This code was using get_user_pages*(), in approximately a "Case 5"
scenario (accessing the data within a page), using the categorization
from [1]. That means that it's time to convert the get_user_pages*() +
put_page() calls to pin_user_pages*() + unpin_user_pages() calls.
There is some helpful background in [2]: basically, this is a small part
of fixing a long-standing disconnect between pinning pages, and file
systems' use of those pages.
[1] Documentation/core-api/pin_user_pages.rst
[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Jason Wang <jasowang@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Link: http://lkml.kernel.org/r/20200529234309.484480-3-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
John Hubbard [Mon, 8 Jun 2020 04:41:11 +0000 (21:41 -0700)]
docs: mm/gup: pin_user_pages.rst: add a "case 5"
Patch series "vhost, docs: convert to pin_user_pages(), new "case 5""
It recently became clear to me that there are some get_user_pages*()
callers that don't fit neatly into any of the four cases that are so far
listed in pin_user_pages.rst. vhost.c is one of those.
Add a Case 5 to the documentation, and refer to that when converting
vhost.c.
Thanks to Jan Kara for helping me (again) in understanding the
interaction between get_user_pages() and page writeback [1].
This is based on today's mmotm, which has a nearby patch to
pin_user_pages.rst that rewords cases 3 and 4.
Note that I have only compile-tested the vhost.c patch, although that
does also include cross-compiling for a few other arches. Any run-time
testing would be greatly appreciated.
There are four cases listed in pin_user_pages.rst. These are intended
to help developers figure out whether to use get_user_pages*(), or
pin_user_pages*(). However, the four cases do not cover all the
situations. For example, drivers/vhost/vhost.c has a "pin, write to
page, set page dirty, unpin" case.
Add a fifth case, to help explain that there is a general pattern that
requires pin_user_pages*() API calls.
John Hubbard [Mon, 8 Jun 2020 04:41:08 +0000 (21:41 -0700)]
mm/gup: documentation fix for pin_user_pages*() APIs
All of the pin_user_pages*() API calls will cause pages to be
dma-pinned. As such, they are all suitable for either DMA, RDMA, and/or
Direct IO.
The documentation should say so, but it was instead saying that three of
the API calls were only suitable for Direct IO. This was discovered
when a reviewer wondered why an API call that specifically recommended
against Case 2 (DMA/RDMA) was being used in a DMA situation [1].
Fix this by simply deleting those claims. The gup.c comments already
refer to the more extensive Documentation/core-api/pin_user_pages.rst,
which does have the correct guidance. So let's just write it once,
there.
This code was using get_user_pages*(), and all of the callers so far
were in a "Case 2" scenario (DMA/RDMA), using the categorization from [1].
That means that it's time to convert the get_user_pages*() + put_page()
calls to pin_user_pages*() + unpin_user_pages() calls.
There is some helpful background in [2]: basically, this is a small part
of fixing a long-standing disconnect between pinning pages, and file
systems' use of those pages.
[1] Documentation/core-api/pin_user_pages.rst
[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/
Signed-off-by: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Jérôme Glisse <jglisse@redhat.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Link: http://lkml.kernel.org/r/20200527223243.884385-3-jhubbard@nvidia.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
John Hubbard [Mon, 8 Jun 2020 04:41:02 +0000 (21:41 -0700)]
mm/gup: introduce pin_user_pages_locked()
Patch series "mm/gup: introduce pin_user_pages_locked(), use it in frame_vector.c", v2.
This adds yet one more pin_user_pages*() variant, and uses that to
convert mm/frame_vector.c.
With this, along with maybe 20 or 30 other recent patches in various
trees, we are close to having the relevant gup call sites
converted--with the notable exception of the bio/block layer.
This patch (of 2):
Introduce pin_user_pages_locked(), which is nearly identical to
get_user_pages_locked() except that it sets FOLL_PIN and rejects
FOLL_GET.
As with other pairs of get_user_pages*() and pin_user_pages() API calls,
it's prudent to assert that FOLL_PIN is *not* set in the
get_user_pages*() call, so add that as part of this.
mm/gup.c: convert to use get_user_{page|pages}_fast_only()
API __get_user_pages_fast() renamed to get_user_pages_fast_only() to
align with pin_user_pages_fast_only().
As part of this we will get rid of write parameter. Instead caller will
pass FOLL_WRITE to get_user_pages_fast_only(). This will not change any
existing functionality of the API.
All the callers are changed to pass FOLL_WRITE.
Also introduce get_user_page_fast_only(), and use it in a few places
that hard-code nr_pages to 1.
Updated the documentation of the API.
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Paul Mackerras <paulus@ozlabs.org> [arch/powerpc/kvm] Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michal Suchanek <msuchanek@suse.de> Link: http://lkml.kernel.org/r/1590396812-31277-1-git-send-email-jrdr.linux@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael Aquini [Mon, 8 Jun 2020 04:40:51 +0000 (21:40 -0700)]
kernel/sysctl.c: ignore out-of-range taint bits introduced via kernel.tainted
Users with SYS_ADMIN capability can add arbitrary taint flags to the
running kernel by writing to /proc/sys/kernel/tainted or issuing the
command 'sysctl -w kernel.tainted=...'. This interface, however, is
open for any integer value and this might cause an invalid set of flags
being committed to the tainted_mask bitset.
This patch introduces a simple way for proc_taint() to ignore any
eventual invalid bit coming from the user input before committing those
bits to the kernel tainted_mask.
Signed-off-by: Rafael Aquini <aquini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Link: http://lkml.kernel.org/r/20200512223946.888020-1-aquini@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
panic: add sysctl to dump all CPUs backtraces on oops event
Usually when the kernel reaches an oops condition, it's a point of no
return; in case not enough debug information is available in the kernel
splat, one of the last resorts would be to collect a kernel crash dump
and analyze it. The problem with this approach is that in order to
collect the dump, a panic is required (to kexec-load the crash kernel).
When in an environment of multiple virtual machines, users may prefer to
try living with the oops, at least until being able to properly shutdown
their VMs / finish their important tasks.
This patch implements a way to collect a bit more debug details when an
oops event is reached, by printing all the CPUs backtraces through the
usage of NMIs (on architectures that support that). The sysctl added
(and documented) here was called "oops_all_cpu_backtrace", and when set
will (as the name suggests) dump all CPUs backtraces.
Far from ideal, this may be the last option though for users that for
some reason cannot panic on oops. Most of times oopses are clear enough
to indicate the kernel portion that must be investigated, but in virtual
environments it's possible to observe hypervisor/KVM issues that could
lead to oopses shown in other guests CPUs (like virtual APIC crashes).
This patch hence aims to help debug such complex issues without
resorting to kdump.
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected
Commit 8d8429b66d43 ("kernel/hung_task.c: show all hung tasks before
panic") introduced a change in that we started to show all CPUs
backtraces when a hung task is detected _and_ the sysctl/kernel
parameter "hung_task_panic" is set. The idea is good, because usually
when observing deadlocks (that may lead to hung tasks), the culprit is
another task holding a lock and not necessarily the task detected as
hung.
The problem with this approach is that dumping backtraces is a slightly
expensive task, specially printing that on console (and specially in
many CPU machines, as servers commonly found nowadays). So, users that
plan to collect a kdump to investigate the hung tasks and narrow down
the deadlock definitely don't need the CPUs backtrace on dmesg/console,
which will delay the panic and pollute the log (crash tool would easily
grab all CPUs traces with 'bt -a' command).
Also, there's the reciprocal scenario: some users may be interested in
seeing the CPUs backtraces but not have the system panic when a hung
task is detected. The current approach hence is almost as embedding a
policy in the kernel, by forcing the CPUs backtraces' dump (only) on
hung_task_panic.
This patch decouples the panic event on hung task from the CPUs
backtraces dump, by creating (and documenting) a new sysctl called
"hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
allow individual control. The new mechanism for dumping the CPUs
backtraces on hung task detection respects "hung_task_warnings" by not
dumping the traces in case there's no warnings left.
kernel/watchdog.c: convert {soft/hard}lockup boot parameters to sysctl aliases
After a recent change introduced by Vlastimil's series [0], kernel is
able now to handle sysctl parameters on kernel command line; also, the
series introduced a simple infrastructure to convert legacy boot
parameters (that duplicate sysctls) into sysctl aliases.
This patch converts the watchdog parameters softlockup_panic and
{hard,soft}lockup_all_cpu_backtrace to use the new alias infrastructure.
It fixes the documentation too, since the alias only accepts values 0 or
1, not the full range of integers.
We also took the opportunity here to improve the documentation of the
previously converted hung_task_panic (see the patch series [0]) and put
the alias table in alphabetical order.
Vlastimil Babka [Mon, 8 Jun 2020 04:40:38 +0000 (21:40 -0700)]
lib/test_sysctl: support testing of sysctl. boot parameter
Testing is done by a new parameter debug.test_sysctl.boot_int which
defaults to 0 and it's expected that the tester passes a boot parameter
that sets it to 1. The test checks if it's set to 1.
To distinguish true failure from parameter not being set, the test
checks /proc/cmdline for the expected parameter, and whether test_sysctl
is built-in and not a module.
[vbabka@suse.cz: skip the new test if boot_int sysctl is not present] Link: http://lkml.kernel.org/r/305af605-1e60-cf84-fada-6ce1ca37c102@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Kees Cook <keescook@chromium.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200427180433.7029-6-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Mon, 8 Jun 2020 04:40:35 +0000 (21:40 -0700)]
tools/testing/selftests/sysctl/sysctl.sh: support CONFIG_TEST_SYSCTL=y
The testing script recommends CONFIG_TEST_SYSCTL=y, but actually only
works with CONFIG_TEST_SYSCTL=m. Testing of sysctl setting via boot
param however requires the test to be built-in, so make sure the test
script supports it.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Kees Cook <keescook@chromium.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200427180433.7029-5-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Mon, 8 Jun 2020 04:40:31 +0000 (21:40 -0700)]
kernel/hung_task convert hung_task_panic boot parameter to sysctl
We can now handle sysctl parameters on kernel command line and have
infrastructure to convert legacy command line options that duplicate
sysctl to become a sysctl alias.
This patch converts the hung_task_panic parameter. Note that the sysctl
handler is more strict and allows only 0 and 1, while the legacy
parameter allowed any non-zero value. But there is little reason anyone
would not be using 1.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200427180433.7029-4-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Mon, 8 Jun 2020 04:40:27 +0000 (21:40 -0700)]
kernel/sysctl: support handling command line aliases
We can now handle sysctl parameters on kernel command line, but
historically some parameters introduced their own command line
equivalent, which we don't want to remove for compatibility reasons.
We can, however, convert them to the generic infrastructure with a table
translating the legacy command line parameters to their sysctl names,
and removing the one-off param handlers.
This patch adds the support and makes the first conversion to
demonstrate it, on the (deprecated) numa_zonelist_order parameter.
Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: David Rientjes <rientjes@google.com> Cc: "Eric W . Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Guilherme G . Piccoli" <gpiccoli@canonical.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20200427180433.7029-3-vbabka@suse.cz Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Mon, 8 Jun 2020 04:40:24 +0000 (21:40 -0700)]
kernel/sysctl: support setting sysctl parameters from kernel command line
Patch series "support setting sysctl parameters from kernel command line", v3.
This series adds support for something that seems like many people
always wanted but nobody added it yet, so here's the ability to set
sysctl parameters via kernel command line options in the form of
sysctl.vm.something=1
The important part is Patch 1. The second, not so important part is an
attempt to clean up legacy one-off parameters that do the same thing as
a sysctl. I don't want to remove them completely for compatibility
reasons, but with generic sysctl support the idea is to remove the
one-off param handlers and treat the parameters as aliases for the
sysctl variants.
I have identified several parameters that mention sysctl counterparts in
Documentation/admin-guide/kernel-parameters.txt but there might be more.
The conversion also has varying level of success:
- numa_zonelist_order is converted in Patch 2 together with adding the
necessary infrastructure. It's easy as it doesn't really do anything
but warn on deprecated value these days.
- hung_task_panic is converted in Patch 3, but there's a downside that
now it only accepts 0 and 1, while previously it was any integer
value
- nmi_watchdog maps to two sysctls nmi_watchdog and hardlockup_panic,
so there's no straighforward conversion possible
- traceoff_on_warning is a flag without value and it would be required
to handle that somehow in the conversion infractructure, which seems
pointless for a single flag
This patch (of 5):
A recently proposed patch to add vm_swappiness command line parameter in
addition to existing sysctl [1] made me wonder why we don't have a
general support for passing sysctl parameters via command line.
Googling found only somebody else wondering the same [2], but I haven't
found any prior discussion with reasons why not to do this.
Settings the vm_swappiness issue aside (the underlying issue might be
solved in a different way), quick search of kernel-parameters.txt shows
there are already some that exist as both sysctl and kernel parameter -
hung_task_panic, nmi_watchdog, numa_zonelist_order, traceoff_on_warning.
A general mechanism would remove the need to add more of those one-offs
and might be handy in situations where configuration by e.g.
/etc/sysctl.d/ is impractical.
Hence, this patch adds a new parse_args() pass that looks for parameters
prefixed by 'sysctl.' and tries to interpret them as writes to the
corresponding sys/ files using an temporary in-kernel procfs mount.
This mechanism was suggested by Eric W. Biederman [3], as it handles
all dynamically registered sysctl tables, even though we don't handle
modular sysctls. Errors due to e.g. invalid parameter name or value
are reported in the kernel log.
The processing is hooked right before the init process is loaded, as
some handlers might be more complicated than simple setters and might
need some subsystems to be initialized. At the moment the init process
can be started and eventually execute a process writing to /proc/sys/
then it should be also fine to do that from the kernel.
Sysctls registered later on module load time are not set by this
mechanism - it's expected that in such scenarios, setting sysctl values
from userspace is practical enough.
Rafael Aquini [Mon, 8 Jun 2020 04:40:17 +0000 (21:40 -0700)]
kernel: add panic_on_taint
Analogously to the introduction of panic_on_warn, this patch introduces
a kernel option named panic_on_taint in order to provide a simple and
generic way to stop execution and catch a coredump when the kernel gets
tainted by any given flag.
This is useful for debugging sessions as it avoids having to rebuild the
kernel to explicitly add calls to panic() into the code sites that
introduce the taint flags of interest.
For instance, if one is interested in proceeding with a post-mortem
analysis at the point a given code path is hitting a bad page (i.e.
unaccount_page_cache_page(), or slab_bug()), a coredump can be collected
by rebooting the kernel with 'panic_on_taint=0x20' amended to the
command line.
Another, perhaps less frequent, use for this option would be as a means
for assuring a security policy case where only a subset of taints, or no
single taint (in paranoid mode), is allowed for the running system. The
optional switch 'nousertaint' is handy in this particular scenario, as
it will avoid userspace induced crashes by writes to sysctl interface
/proc/sys/kernel/tainted causing false positive hits for such policies.
Orson Zhai [Mon, 8 Jun 2020 04:40:14 +0000 (21:40 -0700)]
dynamic_debug: add an option to enable dynamic debug for modules only
Instead of enabling dynamic debug globally with CONFIG_DYNAMIC_DEBUG,
CONFIG_DYNAMIC_DEBUG_CORE will only enable core function of dynamic
debug. With the DYNAMIC_DEBUG_MODULE defined for any modules, dynamic
debug will be tied to them.
This is useful for people who only want to enable dynamic debug for
kernel modules without worrying about kernel image size and memory
consumption is increasing too much.
[orson.zhai@unisoc.com: v2] Link: http://lkml.kernel.org/r/1587408228-10861-1-git-send-email-orson.unisoc@gmail.com Signed-off-by: Orson Zhai <orson.zhai@unisoc.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Acked-by: Petr Mladek <pmladek@suse.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Randy Dunlap <rdunlap@infradead.org> Link: http://lkml.kernel.org/r/1586521984-5890-1-git-send-email-orson.unisoc@gmail.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
for (i = 0; i < 1000; i++)
if (unshare(CLONE_NEWIPC) < 0)
error(EXIT_FAILURE, errno, "unshare");
}
goes from
Command being timed: "./ipc-namespace"
User time (seconds): 0.00
System time (seconds): 0.06
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.05
to
Command being timed: "./ipc-namespace"
User time (seconds): 0.00
System time (seconds): 0.02
Percent of CPU this job got: 96%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.03
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Paul E. McKenney <paulmck@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Manfred Spraul <manfred@colorfullife.com> Link: http://lkml.kernel.org/r/20200225145419.527994-1-gscrivan@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
SeongJae Park [Mon, 8 Jun 2020 04:40:04 +0000 (21:40 -0700)]
mm/page_idle.c: skip offline pages
'Idle page tracking' users can pass random pfn that might be mapped to an
offline page. To avoid accessing such pages, this commit modifies the
'page_idle_get_page()' to use 'pfn_to_online_page()' instead of
'pfn_valid()' and 'pfn_to_page()' combination, so that the pfn mapped to
an offline page can be skipped.
Reported-by: David Hildenbrand <david@redhat.com> Signed-off-by: SeongJae Park <sjpark@amazon.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pankaj Gupta <pankaj.gupta.linux@gmail.com> Link: http://lkml.kernel.org/r/20200605092502.18018-2-sjpark@amazon.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 7 Jun 2020 17:59:32 +0000 (10:59 -0700)]
Merge tag 'char-misc-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver updates from Greg KH:
"Here is the large set of char/misc driver patches for 5.8-rc1
Included in here are:
- habanalabs driver updates, loads
- mhi bus driver updates
- extcon driver updates
- clk driver updates (approved by the clock maintainer)
- firmware driver updates
- fpga driver updates
- gnss driver updates
- coresight driver updates
- interconnect driver updates
- parport driver updates (it's still alive!)
- nvmem driver updates
- soundwire driver updates
- visorbus driver updates
- w1 driver updates
- various misc driver updates
In short, loads of different driver subsystem updates along with the
drivers as well.
All have been in linux-next for a while with no reported issues"
* tag 'char-misc-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (233 commits)
habanalabs: correctly cast u64 to void*
habanalabs: initialize variable to default value
extcon: arizona: Fix runtime PM imbalance on error
extcon: max14577: Add proper dt-compatible strings
extcon: adc-jack: Fix an error handling path in 'adc_jack_probe()'
extcon: remove redundant assignment to variable idx
w1: omap-hdq: print dev_err if irq flags are not cleared
w1: omap-hdq: fix interrupt handling which did show spurious timeouts
w1: omap-hdq: fix return value to be -1 if there is a timeout
w1: omap-hdq: cleanup to add missing newline for some dev_dbg
/dev/mem: Revoke mappings when a driver claims the region
misc: xilinx-sdfec: convert get_user_pages() --> pin_user_pages()
misc: xilinx-sdfec: cleanup return value in xsdfec_table_write()
misc: xilinx-sdfec: improve get_user_pages_fast() error handling
nvmem: qfprom: remove incorrect write support
habanalabs: handle MMU cache invalidation timeout
habanalabs: don't allow hard reset with open processes
habanalabs: GAUDI does not support soft-reset
habanalabs: add print for soft reset due to event
habanalabs: improve MMU cache invalidation code
...
Linus Torvalds [Sun, 7 Jun 2020 17:53:36 +0000 (10:53 -0700)]
Merge tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the set of driver core patches for 5.8-rc1.
Not all that huge this release, just a number of small fixes and
updates:
- software node fixes
- kobject now sends KOBJ_REMOVE when it is removed from sysfs, not
when it is removed from memory (which could come much later)
- device link additions and fixes based on testing on more devices
- firmware core cleanups
- other minor changes, full details in the shortlog
All have been in linux-next for a while with no reported issues"
* tag 'driver-core-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (23 commits)
driver core: Update device link status correctly for SYNC_STATE_ONLY links
firmware_loader: change enum fw_opt to u32
software node: implement software_node_unregister()
kobject: send KOBJ_REMOVE uevent when the object is removed from sysfs
driver core: Remove unnecessary is_fwnode_dev variable in device_add()
drivers property: When no children in primary, try secondary
driver core: platform: Fix spelling errors in platform.c
driver core: Remove check in driver_deferred_probe_force_trigger()
of: platform: Batch fwnode parsing when adding all top level devices
driver core: fw_devlink: Add support for batching fwnode parsing
driver core: Look for waiting consumers only for a fwnode's primary device
driver core: Move code to the right part of the file
Revert "Revert "driver core: Set fw_devlink to "permissive" behavior by default""
drivers: base: Fix NULL pointer exception in __platform_driver_probe() if a driver developer is foolish
firmware_loader: move fw_fallback_config to a private kernel symbol namespace
driver core: Add missing '\n' in log messages
driver/base/soc: Use kobj_to_dev() API
Add documentation on meaning of -EPROBE_DEFER
driver core: platform: remove redundant assignment to variable ret
debugfs: Use the correct style for SPDX License Identifier
...
Linus Torvalds [Sun, 7 Jun 2020 17:45:08 +0000 (10:45 -0700)]
Merge tag 'staging-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO driver updates from Greg KH:
"Here is the large set of staging and IIO driver changes for 5.8-rc1
Nothing major, but a lot of new IIO drivers are included in here,
along with other core iio cleanups and changes.
On the staging driver front, again, nothing noticable. No new
deletions or additions, just a ton of tiny cleanups all over the tree
done by a lot of different people. Most coding style, but many actual
real fixes and cleanups that are nice to see.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (618 commits)
staging: rtl8723bs: Use common packet header constants
staging: sm750fb: Add names to proc_setBLANK args
staging: most: usb: init return value in default path of switch/case expression
staging: vchiq: Get rid of VCHIQ_SERVICE_OPENEND callback reason
staging: vchiq: move vchiq_release_message() into vchiq
staging: vchi: Get rid of C++ guards
staging: vchi: Get rid of not implemented function declarations
staging: vchi: Get rid of vchiq_status_to_vchi()
staging: vchi: Get rid of vchi_service_set_option()
staging: vchi: Merge vchi_msg_queue() into vchi_queue_kernel_message()
staging: vchiq: Move copy callback handling into vchiq
staging: vchi: Get rid of vchi_queue_user_message()
staging: vchi: Get rid of vchi_service_destroy()
staging: most: usb: use function sysfs_streq
staging: most: usb: add missing put_device calls
staging: most: usb: use correct error codes
staging: most: usb: replace code to calculate array index
staging: most: usb: don't use error path to exit function on success
staging: most: usb: move allocation of URB out of critical section
staging: most: usb: return 0 instead of variable
...
Linus Torvalds [Sun, 7 Jun 2020 16:52:36 +0000 (09:52 -0700)]
Merge tag 'tty-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the tty and serial driver updates for 5.8-rc1
Nothing huge at all, just a lot of little serial driver fixes, updates
for new devices and features, and other small things. Full details are
in the shortlog.
All of these have been in linux-next with no issues for a while"
* tag 'tty-5.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (67 commits)
tty: serial: qcom_geni_serial: Add 51.2MHz frequency support
tty: serial: imx: clear Ageing Timer Interrupt in handler
serial: 8250_fintek: Add F81966 Support
sc16is7xx: Add flag to activate IrDA mode
dt-bindings: sc16is7xx: Add flag to activate IrDA mode
serial: 8250: Support rs485 bus termination GPIO
serial: 8520_port: Fix function param documentation
dt-bindings: serial: Add binding for rs485 bus termination GPIO
vt: keyboard: avoid signed integer overflow in k_ascii
serial: 8250: Enable 16550A variants by default on non-x86
tty: hvc_console, fix crashes on parallel open/close
serial: imx: Initialize lock for non-registered console
sc16is7xx: Read the LSR register for basic device presence check
sc16is7xx: Allow sharing the IRQ line
sc16is7xx: Use threaded IRQ
sc16is7xx: Always use falling edge IRQ
tty: n_gsm: Fix bogus i++ in gsm_data_kick
tty: n_gsm: Remove unnecessary test in gsm_print_packet()
serial: stm32: add no_console_suspend support
tty: serial: fsl_lpuart: Use __maybe_unused instead of #if CONFIG_PM_SLEEP
...
Linus Torvalds [Sat, 6 Jun 2020 22:22:01 +0000 (15:22 -0700)]
Merge tag 'sh-for-5.8' of git://git.libc.org/linux-sh
Pull arch/sh updates from Rich Felker:
"Fix for arch/sh build regression with newer binutils, removal of SH5,
fixes for module exports, and misc cleanup"
* tag 'sh-for-5.8' of git://git.libc.org/linux-sh:
sh: remove sh5 support
sh: add missing EXPORT_SYMBOL() for __delay
sh: Convert ins[bwl]/outs[bwl] macros to inline functions
sh: Convert iounmap() macros to inline functions
sh: Add missing DECLARE_EXPORT() for __ashiftrt_r4_xx
sh: configs: Cleanup old Kconfig IO scheduler options
arch/sh: vmlinux.scr
sh: Replace CONFIG_MTD_M25P80 with CONFIG_MTD_SPI_NOR in sh7757lcr_defconfig
sh: sh4a: Bring back tmu3_device early device
Linus Torvalds [Sat, 6 Jun 2020 19:07:28 +0000 (12:07 -0700)]
Merge tag 'kconfig-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kconfig updates from Masahiro Yamada:
- allow only 'config', 'comment', 'if' statements inside 'choice' since
the other statements are not sensible inside 'choice' and should be
grammatical error
- support LMC_KEEP env variable for 'make local{yes,mod}config' to
preserve some CONFIG options
- deprecate 'make kvmconfig' and 'make xenconfig' in favor of
'make kvm_guest.config' and 'make xen.config'
- code cleanups
* tag 'kconfig-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
kconfig: announce removal of 'kvmconfig' and 'xenconfig' shorthands
streamline_config.pl: add LMC_KEEP to preserve some kconfigs
kconfig: allow only 'config', 'comment', and 'if' inside 'choice'
kconfig: tests: remove randconfig test for choice in choice
kconfig: do not assign a variable in the return statement
kconfig: do not use OR-assignment for zero-cleared structure
Linus Torvalds [Sat, 6 Jun 2020 19:00:25 +0000 (12:00 -0700)]
Merge tag 'kbuild-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kbuild updates from Masahiro Yamada:
- fix warnings in 'make clean' for ARCH=um, hexagon, h8300, unicore32
- ensure to rebuild all objects when the compiler is upgraded
- exclude system headers from dependency tracking and fixdep processing
- fix potential bit-size mismatch between the kernel and BPF user-mode
helper
- add the new syntax 'userprogs' to build user-space programs for the
target architecture (the same arch as the kernel)
- compile user-space sample code under samples/ for the target arch
instead of the host arch
- make headers_install fail if a CONFIG option is leaked to user-space
- sanitize the output format of scripts/checkstack.pl
- handle ARM 'push' instruction in scripts/checkstack.pl
- error out before modpost if a module name conflict is found
- error out when multiple directories are passed to M= because this
feature is broken for a long time
- add CONFIG_DEBUG_INFO_COMPRESSED to support compressed debug info
- a lot of cleanups of modpost
- dump vmlinux symbols out into vmlinux.symvers, and reuse it in the
second pass of modpost
- do not run the second pass of modpost if nothing in modules is
updated
- install modules.builtin(.modinfo) by 'make install' as well as by
'make modules_install' because it is useful even when
CONFIG_MODULES=n
- add new command line variables, GZIP, BZIP2, LZOP, LZMA, LZ4, and XZ
to allow users to use alternatives such as pigz, pbzip2, etc.
* tag 'kbuild-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (96 commits)
kbuild: add variables for compression tools
Makefile: install modules.builtin even if CONFIG_MODULES=n
mksysmap: Fix the mismatch of '.L' symbols in System.map
kbuild: doc: rename LDFLAGS to KBUILD_LDFLAGS
modpost: change elf_info->size to size_t
modpost: remove is_vmlinux() helper
modpost: strip .o from modname before calling new_module()
modpost: set have_vmlinux in new_module()
modpost: remove mod->skip struct member
modpost: add mod->is_vmlinux struct member
modpost: remove is_vmlinux() call in check_for_{gpl_usage,unused}()
modpost: remove mod->is_dot_o struct member
modpost: move -d option in scripts/Makefile.modpost
modpost: remove -s option
modpost: remove get_next_text() and make {grab,release_}file static
modpost: use read_text_file() and get_line() for reading text files
modpost: avoid false-positive file open error
modpost: fix potential mmap'ed file overrun in get_src_version()
modpost: add read_text_file() and get_line() helpers
modpost: do not call get_modinfo() for vmlinux(.o)
...
Linus Torvalds [Sat, 6 Jun 2020 18:55:53 +0000 (11:55 -0700)]
Merge tag 'dma-mapping-5.8-2' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping helpers from Christoph Hellwig:
"These were in a separate stable branch so that various media and drm
trees could pull the in for bug fixes, but looking at linux-next that
hasn't actually happened yet. Still sending the APIs to you in the
hope that these bug fixes get picked up for 5.8 in one way or another.
Summary:
- add DMA mapping helpers for struct sg_table (Marek Szyprowski)"
* tag 'dma-mapping-5.8-2' of git://git.infradead.org/users/hch/dma-mapping:
iommu: add generic helper for mapping sgtable objects
scatterlist: add generic wrappers for iterating over sgtable objects
dma-mapping: add generic helpers for mapping sgtable objects
Linus Torvalds [Sat, 6 Jun 2020 18:43:23 +0000 (11:43 -0700)]
Merge tag 'dma-mapping-5.8' of git://git.infradead.org/users/hch/dma-mapping
Pull dma-mapping updates from Christoph Hellwig:
- enhance the dma pool to allow atomic allocation on x86 with AMD SEV
(David Rientjes)
- two small cleanups (Jason Yan and Peter Collingbourne)
* tag 'dma-mapping-5.8' of git://git.infradead.org/users/hch/dma-mapping:
dma-contiguous: fix comment for dma_release_from_contiguous
dma-pool: scale the default DMA coherent pool size with memory capacity
x86/mm: unencrypted non-blocking DMA allocations use coherent pools
dma-pool: add pool sizes to debugfs
dma-direct: atomic allocations must come from atomic coherent pools
dma-pool: dynamically expanding atomic pools
dma-pool: add additional coherent pools to map to gfp mask
dma-remap: separate DMA atomic pools from direct remap code
dma-debug: make __dma_entry_alloc_check_leak() static
- Work around Intel PCH MROMs that have invalid BARs (Xiaochun Lee)"
* tag 'pci-v5.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (100 commits)
PCI: uniphier: Add Socionext UniPhier Pro5 PCIe endpoint controller driver
PCI: Add ACS quirk for Intel Root Complex Integrated Endpoints
PCI/DPC: Print IRQ number used by port
PCI/AER: Use "aer" variable for capability offset
PCI/AER: Remove redundant dev->aer_cap checks
PCI/AER: Remove redundant pci_is_pcie() checks
PCI/AER: Remove HEST/FIRMWARE_FIRST parsing for AER ownership
PCI: tegra: Fix runtime PM imbalance on error
PCI: vmd: Filter resource type bits from shadow register
PCI: tegra194: Fix runtime PM imbalance on error
dt-bindings: PCI: Add UniPhier PCIe endpoint controller description
PCI: hv: Use struct_size() helper
PCI: Rename _DSM constants to align with spec
PCI: Avoid FLR for AMD Starship USB 3.0
PCI: Avoid FLR for AMD Matisse HD Audio & USB 3.0
x86/PCI: Drop unused xen_register_pirq() gsi_override parameter
PCI: dwc: Use private data pointer of "struct irq_domain" to get pcie_port
PCI: amlogic: meson: Don't use FAST_LINK_MODE to set up link
PCI: dwc: Fix inner MSI IRQ domain registration
PCI: dwc: pci-dra7xx: Use devm_platform_ioremap_resource_byname()
...
Linus Torvalds [Sat, 6 Jun 2020 17:01:48 +0000 (10:01 -0700)]
Merge branch 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq
Pull workqueue updates from Tejun Heo:
"Mostly cleanups and other trivial changes.
The only interesting change is Sebastian's rcuwait conversion for RT"
* 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: use BUILD_BUG_ON() for compile time test instead of WARN_ON()
workqueue: fix a piece of comment about reserved bits for work flags
workqueue: remove useless unlock() and lock() in series
workqueue: void unneeded requeuing the pwq in rescuer thread
workqueue: Convert the pool::lock and wq_mayday_lock to raw_spinlock_t
workqueue: Use rcuwait for wq_manager_wait
workqueue: Remove unnecessary kfree() call in rcu_free_wq()
workqueue: Fix an use after free in init_rescuer()
workqueue: Use IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO.
Linus Torvalds [Sat, 6 Jun 2020 16:59:34 +0000 (09:59 -0700)]
Merge branch 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Just two patches: one to add system-level cpu.stat to the root cgroup
for convenience and a trivial comment update"
* 'for-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: add cpu.stat file to root cgroup
cgroup: Remove stale comments
Linus Torvalds [Sat, 6 Jun 2020 16:39:05 +0000 (09:39 -0700)]
Merge tag 'integrity-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity
Pull integrity updates from Mimi Zohar:
"The main changes are extending the TPM 2.0 PCR banks with bank
specific file hashes, calculating the "boot_aggregate" based on other
TPM PCR banks, using the default IMA hash algorithm, instead of SHA1,
as the basis for the cache hash table key, and preventing the mprotect
syscall to circumvent an IMA mmap appraise policy rule.
- In preparation for extending TPM 2.0 PCR banks with bank specific
digests, commit 9922ccf999c5 ("tpm: pass an array of
tpm_extend_digest structures to tpm_pcr_extend()") modified
tpm_pcr_extend(). The original SHA1 file digests were
padded/truncated, before being extended into the other TPM PCR
banks. This pull request calculates and extends the TPM PCR banks
with bank specific file hashes completing the above change.
- The "boot_aggregate", the first IMA measurement list record, is the
"trusted boot" link between the pre-boot environment and the
running OS. With TPM 2.0, the "boot_aggregate" record is not
limited to being based on the SHA1 TPM PCR bank, but can be
calculated based on any enabled bank, assuming the hash algorithm
is also enabled in the kernel.
Other changes include the following and five other bug fixes/code
clean up:
- supporting both a SHA1 and a larger "boot_aggregate" digest in a
custom template format containing both the the SHA1 ('d') and
larger digests ('d-ng') fields.
- Initial hash table key fix, but additional changes would be good"
* tag 'integrity-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity:
ima: Directly free *entry in ima_alloc_init_template() if digests is NULL
ima: Call ima_calc_boot_aggregate() in ima_eventdigest_init()
ima: Directly assign the ima_default_policy pointer to ima_rules
ima: verify mprotect change is consistent with mmap policy
evm: Fix possible memory leak in evm_calc_hmac_or_hash()
ima: Set again build_ima_appraise variable
ima: Remove redundant policy rule set in add_rules()
ima: Fix ima digest hash table key calculation
ima: Use ima_hash_algo for collision detection in the measurement list
ima: Calculate and extend PCR with digests in ima_template_entry
ima: Allocate and initialize tfm for each PCR bank
ima: Switch to dynamically allocated buffer for template digests
ima: Store template digest directly in ima_template_entry
ima: Evaluate error in init_ima()
ima: Switch to ima_hash_algo for boot aggregate
Denis Efremov [Fri, 5 Jun 2020 07:39:55 +0000 (10:39 +0300)]
kbuild: add variables for compression tools
Allow user to use alternative implementations of compression tools,
such as pigz, pbzip2, pxz. For example, multi-threaded tools to
speed up the build:
$ make GZIP=pigz BZIP2=pbzip2
Variables _GZIP, _BZIP2, _LZOP are used internally because original env
vars are reserved by the tools. The use of GZIP in gzip tool is obsolete
since 2015. However, alternative implementations (e.g., pigz) still rely
on it. BZIP2, BZIP, LZOP vars are not obsolescent.
The credit goes to @grsecurity.
As a sidenote, for multi-threaded lzma, xz compression one can use:
$ export XZ_OPT="--threads=0"
ashimida [Tue, 2 Jun 2020 07:45:17 +0000 (15:45 +0800)]
mksysmap: Fix the mismatch of '.L' symbols in System.map
When System.map was generated, the kernel used mksysmap to
filter the kernel symbols, but all the symbols with the
second letter 'L' in the kernel were filtered out, not just
the symbols starting with 'dot + L'.
For example:
ashimida@ubuntu:~/linux$ cat System.map |grep ' .L'
ashimida@ubuntu:~/linux$ nm -n vmlinux |grep ' .L' ffff0000088028e0 t bLength_show
...... ffff0000092e0408 b PLLP_OUTC_lock ffff0000092e0410 b PLLP_OUTA_lock
The original intent should be to filter out all local symbols
starting with '.L', so the dot should be escaped.
Fixes: 8cc909fd7fca ("mksysmap: Add h8300 local symbol pattern") Signed-off-by: ashimida <ashimida@linux.alibaba.com> Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Masahiro Yamada [Mon, 1 Jun 2020 05:57:29 +0000 (14:57 +0900)]
modpost: strip .o from modname before calling new_module()
new_module() conditionally strips the .o because the modname has .o
suffix when it is called from read_symbols(), but no .o when it is
called from read_dump().
Masahiro Yamada [Mon, 1 Jun 2020 05:57:24 +0000 (14:57 +0900)]
modpost: remove mod->is_dot_o struct member
Previously, there were two cases where mod->is_dot_o is unset:
[1] the executable 'vmlinux' in the second pass of modpost
[2] modules loaded by read_dump()
I think [1] was intended usage to distinguish 'vmlinux.o' and 'vmlinux'.
Now that modpost does not parse the executable 'vmlinux', this case
does not happen.
[2] is obscure, maybe a bug. Module.symver stores module paths without
extension. So, none of modules loaded by read_dump() has the .o suffix,
and new_module() unsets ->is_dot_o. Anyway, it is not a big deal because
handle_symbol() is not called for the case.
Masahiro Yamada [Mon, 1 Jun 2020 05:57:20 +0000 (14:57 +0900)]
modpost: use read_text_file() and get_line() for reading text files
grab_file() mmaps a file, but it is not so efficient here because
get_next_line() copies every line to the temporary buffer anyway.
read_text_file() and get_line() are simpler. get_line() exploits the
library function strchr().
Going forward, the missing *.symvers or *.cmd is a fatal error.
This should not happen because scripts/Makefile.modpost guards the
-i option files with $(wildcard $(input-symdump)).
Masahiro Yamada [Mon, 1 Jun 2020 05:57:19 +0000 (14:57 +0900)]
modpost: avoid false-positive file open error
One problem of grab_file() is that it cannot distinguish the following
two cases:
- It cannot read the file (the file does not exist, or read permission
is not set)
- It can read the file, but the file size is zero
This is because grab_file() calls mmap(), which requires the mapped
length is greater than 0. Hence, grab_file() fails for both cases.
If an empty header file were included for checksum calculation, the
following warning would be printed:
WARNING: modpost: could not open ...: Invalid argument
An empty file is a valid source file, so it should not fail.
Use read_text_file() instead. It can read a zero-length file.
Then, parse_file() will succeed with doing nothing.
Going forward, the first case (it cannot read the file) is a fatal
error. If the source file from which an object was compiled is missing,
something went wrong.
Masahiro Yamada [Mon, 1 Jun 2020 05:57:18 +0000 (14:57 +0900)]
modpost: fix potential mmap'ed file overrun in get_src_version()
I do not know how reliably this function works, but it looks dangerous
to me.
strchr(sources, '\n');
... continues searching until it finds '\n' or it reaches the '\0'
terminator. In other words, 'sources' should be a null-terminated
string.
However, grab_file() just mmaps a file, so 'sources' is not terminated
with null byte. If the file does not contain '\n' at all, strchr() will
go beyond the mmap'ed memory.
Use read_text_file(), which loads the file content into a malloc'ed
buffer, appending null byte.
Here we are interested only in the first line of *.mod files. Use
get_line() helper to get the first line.
This also makes missing *.mod file a fatal error.
Commit a46901d7aefd ("kbuild: do not emit src version warning for
non-modules") ignored missing *.mod files.
I do not fully understand what that commit addressed, but commit f07d7fd4fec6 ("kbuild: introduce new option to enhance section mismatch
analysis") introduced partial section checks by using modpost. built-in.o
was parsed by modpost. Even modules had a problem because *.mod files
were created after the modpost check.
Commit 9caf5106f694 ("kbuild: create *.mod with full directory path and
remove MODVERDIR") stopped doing that. Now that modpost is only invoked
after the directory descend, *.mod files should always exist at the
modpost stage.
Masahiro Yamada [Mon, 1 Jun 2020 05:57:17 +0000 (14:57 +0900)]
modpost: add read_text_file() and get_line() helpers
modpost uses grab_file() to open a file, but it is not suitable for
a text file because the mmap'ed file is not terminated by null byte.
Actually, I see some issues for the use of grab_file().
The new helper, read_text_file() loads the whole file content into a
malloc'ed buffer, and appends a null byte. Then, get_line() reads
each line.
To handle text files, I intend to replace as follows:
Masahiro Yamada [Mon, 1 Jun 2020 05:57:16 +0000 (14:57 +0900)]
modpost: do not call get_modinfo() for vmlinux(.o)
The three calls of get_modinfo() ("license", "import_ns", "version")
always return NULL for vmlinux(.o) because the built-in module info is
prefixed with __MODULE_INFO_PREFIX.
It is harmless to call get_modinfo(), but there is no point to search
for what apparently does not exist.
Masahiro Yamada [Mon, 1 Jun 2020 05:57:13 +0000 (14:57 +0900)]
modpost: show warning if vmlinux is not found when processing modules
check_exports() does not print warnings about unresolved symbols if
vmlinux is missing because there would be too many.
This situation happens when you do 'make modules' from the clean
tree, or compile external modules against a kernel tree that has
not been completely built.
It is dangerous to not check unresolved symbols because you might be
building useless modules. At least it should be warned.
Masahiro Yamada [Mon, 1 Jun 2020 05:57:11 +0000 (14:57 +0900)]
modpost: generate vmlinux.symvers and reuse it for the second modpost
The full build runs modpost twice, first for vmlinux.o and second for
modules.
The first pass dumps all the vmlinux symbols into Module.symvers, but
the second pass parses vmlinux again instead of reusing the dump file,
presumably because it needs to avoid accumulating stale symbols.
Loading symbol info from a dump file is faster than parsing an ELF object.
Besides, modpost deals with various issues to parse vmlinux in the second
pass.
A solution is to make the first pass dumps symbols into a separate file,
vmlinux.symvers. The second pass reads it, and parses module .o files.
The merged symbol information is dumped into Module.symvers in the same
way as before.
This makes further modpost cleanups possible.
Also, it fixes the problem of 'make vmlinux', which previously overwrote
Module.symvers, throwing away module symbols.
I slightly touched scripts/link-vmlinux.sh so that vmlinux is re-linked
when you cross this commit. Otherwise, vmlinux.symvers would not be
generated.
Masahiro Yamada [Mon, 1 Jun 2020 05:57:05 +0000 (14:57 +0900)]
modpost: track if the symbol origin is a dump file or ELF object
The meaning of sym->kernel is obscure; it is set for in-kernel symbols
loaded from Modules.symvers. This happens only when we are building
external modules, and it is used to determine whether to dump symbols
to $(KBUILD_EXTMOD)/Modules.symvers
It is clearer to remember whether the symbol or module came from a dump
file or ELF object.
This changes the KBUILD_EXTRA_SYMBOLS behavior. Previously, symbols
loaded from KBUILD_EXTRA_SYMBOLS are accumulated into the current
$(KBUILD_EXTMOD)/Modules.symvers
Going forward, they will be only used to check symbol references, but
not dumped into the current $(KBUILD_EXTMOD)/Modules.symvers. I believe
this makes more sense.
Some vendors like HPe or Dell, encode the release version of their BIOS
in the "System BIOS {Major|Minor} Release" fields of Type 0.
This information is used to know which bios release actually runs.
It could be used for some quirks, debugging sessions or inventory tasks.
A typical output for a Dell system running the 65.27 bios is :
[root@t1700 ~]# cat /sys/devices/virtual/dmi/id/bios_release
65.27
[root@t1700 ~]#
Servers that have a BMC encode the release version of their firmware in the
"Embedded Controller Firmware {Major|Minor} Release" fields of Type 0.
This information is used to know which BMC release actually runs.
It could be used for some quirks, debugging sessions or inventory tasks.
A typical output for a Dell system running the 3.75 bmc release is :
[root@t1700 ~]# cat /sys/devices/virtual/dmi/id/ec_firmware_release
3.75
[root@t1700 ~]#
Signed-off-by: Erwan Velu <e.velu@criteo.com> Signed-off-by: Jean Delvare <jdelvare@suse.de>
Linus Torvalds [Fri, 5 Jun 2020 23:44:36 +0000 (16:44 -0700)]
Merge tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux
Pull orangefs updates from Mike Marshall:
- John Hubbard's conversion from get_user_pages() to pin_user_pages()
- Colin Ian King's removal of an unneeded variable initialization
* tag 'for-linus-5.8-ofs1' of git://git.kernel.org/pub/scm/linux/kernel/git/hubcap/linux:
orangefs: convert get_user_pages() --> pin_user_pages()
orangefs: remove redundant assignment to variable ret
Linus Torvalds [Fri, 5 Jun 2020 23:43:16 +0000 (16:43 -0700)]
Merge tag 'dlm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm
Pull dlm updates from David Teigland:
"This set includes a couple minor cleanups, and dropping the
interruptible from a wait_event that waits for an event from the
userspace cluster management"
* tag 'dlm-5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
dlm: remove BUG() before panic()
dlm: Switch to using wait_event()
fs:dlm:remove unneeded semicolon in rcom.c
dlm: user: Replace zero-length array with flexible-array member
dlm: dlm_internal: Replace zero-length array with flexible-array member
Linus Torvalds [Fri, 5 Jun 2020 23:40:53 +0000 (16:40 -0700)]
Merge tag '5.8-rc-smb3-fixes-part-1' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs updates from Steve French:
"22 changesets, 2 for stable.
Includes big performance improvement for large i/o when using
multichannel, also includes DFS fixes"
* tag '5.8-rc-smb3-fixes-part-1' of git://git.samba.org/sfrench/cifs-2.6: (22 commits)
cifs: update internal module version number
cifs: multichannel: try to rebind when reconnecting a channel
cifs: multichannel: use pointer for binding channel
smb3: remove static checker warning
cifs: multichannel: move channel selection above transport layer
cifs: multichannel: always zero struct cifs_io_parms
cifs: dump Security Type info in DebugData
smb3: fix incorrect number of credits when ioctl MaxOutputResponse > 64K
smb3: default to minimum of two channels when multichannel specified
cifs: multichannel: move channel selection in function
cifs: fix minor typos in comments and log messages
smb3: minor update to compression header definitions
cifs: minor fix to two debug messages
cifs: Standardize logging output
smb3: Add new parm "nodelete"
cifs: move some variables off the stack in smb2_ioctl_query_info
cifs: reduce stack use in smb2_compound_op
cifs: get rid of unused parameter in reconn_setup_dfs_targets()
cifs: handle hostnames that resolve to same ip in failover
cifs: set up next DFS target before generic_ip_connect()
...
Linus Torvalds [Fri, 5 Jun 2020 23:26:36 +0000 (16:26 -0700)]
Merge tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS updates from David Howells:
"There's some core VFS changes which affect a couple of filesystems:
- Make the inode hash table RCU safe and providing some RCU-safe
accessor functions. The search can then be done without taking the
inode_hash_lock. Care must be taken because the object may be being
deleted and no wait is made.
- Allow iunique() to avoid taking the inode_hash_lock.
- Allow AFS's callback processing to avoid taking the inode_hash_lock
when using the inode table to find an inode to notify.
- Improve Ext4's time updating. Konstantin Khlebnikov said "For now,
I've plugged this issue with try-lock in ext4 lazy time update.
This solution is much better."
Then there's a set of changes to make a number of improvements to the
AFS driver:
- Improve callback (ie. third party change notification) processing
by:
(a) Relying more on the fact we're doing this under RCU and by
using fewer locks. This makes use of the RCU-based inode
searching outlined above.
(b) Moving to keeping volumes in a tree indexed by volume ID
rather than a flat list.
(c) Making the server and volume records logically part of the
cell. This means that a server record now points directly at
the cell and the tree of volumes is there. This removes an N:M
mapping table, simplifying things.
- Improve keeping NAT or firewall channels open for the server
callbacks to reach the client by actively polling the fileserver on
a timed basis, instead of only doing it when we have an operation
to process.
- Improving detection of delayed or lost callbacks by including the
parent directory in the list of file IDs to be queried when doing a
bulk status fetch from lookup. We can then check to see if our copy
of the directory has changed under us without us getting notified.
- Determine aliasing of cells (such as a cell that is pointed to be a
DNS alias). This allows us to avoid having ambiguity due to
apparently different cells using the same volume and file servers.
- Improve the fileserver rotation to do more probing when it detects
that all of the addresses to a server are listed as non-responsive.
It's possible that an address that previously stopped responding
has become responsive again.
Beyond that, lay some foundations for making some calls asynchronous:
- Turn the fileserver cursor struct into a general operation struct
and hang the parameters off of that rather than keeping them in
local variables and hang results off of that rather than the call
struct.
- Implement some general operation handling code and simplify the
callers of operations that affect a volume or a volume component
(such as a file). Most of the operation is now done by core code.
- Operations are supplied with a table of operations to issue
different variants of RPCs and to manage the completion, where all
the required data is held in the operation object, thereby allowing
these to be called from a workqueue.
- Put the standard "if (begin), while(select), call op, end" sequence
into a canned function that just emulates the current behaviour for
now.
There are also some fixes interspersed:
- Don't let the EACCES from ICMP6 mapping reach the user as such,
since it's confusing as to whether it's a filesystem error. Convert
it to EHOSTUNREACH.
- Don't use the epoch value acquired through probing a server. If we
have two servers with the same UUID but in different cells, it's
hard to draw conclusions from them having different epoch values.
- Don't interpret the argument to the CB.ProbeUuid RPC as a
fileserver UUID and look up a fileserver from it.
- Deal with servers in different cells having the same UUIDs. In the
event that a CB.InitCallBackState3 RPC is received, we have to
break the callback promises for every server record matching that
UUID.
- Don't let afs_statfs return values that go below 0.
- Don't use running fileserver probe state to make server selection
and address selection decisions on. Only make decisions on final
state as the running state is cleared at the start of probing"
Acked-by: Al Viro <viro@zeniv.linux.org.uk> (fs/inode.c part)
* tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits)
afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
afs: Show more a bit more server state in /proc/net/afs/servers
afs: Don't use probe running state to make decisions outside probe code
afs: Fix afs_statfs() to not let the values go below zero
afs: Fix the by-UUID server tree to allow servers with the same UUID
afs: Reorganise volume and server trees to be rooted on the cell
afs: Add a tracepoint to track the lifetime of the afs_volume struct
afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
afs: Detect cell aliases 2 - Cells with no root volumes
afs: Detect cell aliases 1 - Cells with root volumes
afs: Implement client support for the YFSVL.GetCellName RPC op
afs: Retain more of the VLDB record for alias detection
afs: Fix handling of CB.ProbeUuid cache manager op
afs: Don't get epoch from a server because it may be ambiguous
afs: Build an abstraction around an "operation" concept
afs: Rename struct afs_fs_cursor to afs_operation
afs: Remove the error argument from afs_protocol_error()
afs: Set error flag rather than return error from file status decode
afs: Make callback processing more efficient.
afs: Show more information in /proc/net/afs/servers
...
Linus Torvalds [Fri, 5 Jun 2020 23:19:28 +0000 (16:19 -0700)]
Merge tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A lot of bug fixes and cleanups for ext4, including:
- Fix performance problems found in dioread_nolock now that it is the
default, caused by transaction leaks.
- Clean up fiemap handling in ext4
- Clean up and refactor multiple block allocator (mballoc) code
- Fix a problem with mballoc with a smaller file systems running out
of blocks because they couldn't properly use blocks that had been
reserved by inode preallocation.
- Fixed a race in ext4_sync_parent() versus rename()
- Simplify the error handling in the extent manipulation code
- Make sure all metadata I/O errors are felected to
ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.
- Avoid passing an error pointer to brelse in ext4_xattr_set()
- Fix race which could result to freeing an inode on the dirty last
in data=journal mode.
- Fix refcount handling if ext4_iget() fails
- Fix a crash in generic/019 caused by a corrupted extent node"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
ext4: avoid unnecessary transaction starts during writeback
ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
fs: remove the access_ok() check in ioctl_fiemap
fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
fs: move fiemap range validation into the file systems instances
iomap: fix the iomap_fiemap prototype
fs: move the fiemap definitions out of fs.h
fs: mark __generic_block_fiemap static
ext4: remove the call to fiemap_check_flags in ext4_fiemap
ext4: split _ext4_fiemap
ext4: fix fiemap size checks for bitmap files
ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
add comment for ext4_dir_entry_2 file_type member
jbd2: avoid leaking transaction credits when unreserving handle
ext4: drop ext4_journal_free_reserved()
ext4: mballoc: use lock for checking free blocks while retrying
ext4: mballoc: refactor ext4_mb_good_group()
ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
ext4: mballoc: refactor ext4_mb_discard_preallocations()
...
Linus Torvalds [Fri, 5 Jun 2020 22:45:03 +0000 (15:45 -0700)]
Merge tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- The largest change for this cycle is the DM zoned target's metadata
version 2 feature that adds support for pairing regular block devices
with a zoned device to ease the performance impact associated with
finite random zones of zoned device.
The changes came in three batches: the first prepared for and then
added the ability to pair a single regular block device, the second
was a batch of fixes to improve zoned's reclaim heuristic, and the
third removed the limitation of only adding a single additional
regular block device to allow many devices.
Testing has shown linear scaling as more devices are added.
- Add new emulated block size (ebs) target that emulates a smaller
logical_block_size than a block device supports
The primary use-case is to emulate "512e" devices that have 512 byte
logical_block_size and 4KB physical_block_size. This is useful to
some legacy applications that otherwise wouldn't be able to be used
on 4K devices because they depend on issuing IO in 512 byte
granularity.
- Add discard interfaces to DM bufio. First consumer of the interface
is the dm-ebs target that makes heavy use of dm-bufio.
- Fix DM crypt's block queue_limits stacking to not truncate
logic_block_size.
- Add Documentation for DM integrity's status line.
- Switch DMDEBUG from a compile time config option to instead use
dynamic debug via pr_debug.
- Fix DM multipath target's hueristic for how it manages
"queue_if_no_path" state internally.
DM multipath now avoids disabling "queue_if_no_path" unless it is
actually needed (e.g. in response to configure timeout or explicit
"fail_if_no_path" message).
This fixes reports of spurious -EIO being reported back to userspace
application during fault tolerance testing with an NVMe backend.
Added various dynamic DMDEBUG messages to assist with debugging
queue_if_no_path in the future.
- Add a new DM multipath "Historical Service Time" Path Selector.
- Fix DM multipath's dm_blk_ioctl() to switch paths on IO error.
- Improve DM writecache target performance by using explicit cache
flushing for target's single-threaded usecase and a small cleanup to
remove unnecessary test in persistent_memory_claim.
- Other small cleanups in DM core, dm-persistent-data, and DM
integrity.
* tag 'for-5.8/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (62 commits)
dm crypt: avoid truncating the logical block size
dm mpath: add DM device name to Failing/Reinstating path log messages
dm mpath: enhance queue_if_no_path debugging
dm mpath: restrict queue_if_no_path state machine
dm mpath: simplify __must_push_back
dm zoned: check superblock location
dm zoned: prefer full zones for reclaim
dm zoned: select reclaim zone based on device index
dm zoned: allocate zone by device index
dm zoned: support arbitrary number of devices
dm zoned: move random and sequential zones into struct dmz_dev
dm zoned: per-device reclaim
dm zoned: add metadata pointer to struct dmz_dev
dm zoned: add device pointer to struct dm_zone
dm zoned: allocate temporary superblock for tertiary devices
dm zoned: convert to xarray
dm zoned: add a 'reserved' zone flag
dm zoned: improve logging messages for reclaim
dm zoned: avoid unnecessary device recalulation for secondary superblock
dm zoned: add debugging message for reading superblocks
...
Linus Torvalds [Fri, 5 Jun 2020 22:11:50 +0000 (15:11 -0700)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
:This series consists of the usual driver updates (qla2xxx, ufs, zfcp,
target, scsi_debug, lpfc, qedi, qedf, hisi_sas, mpt3sas) plus a host
of other minor updates.
There are no major core changes in this series apart from a
refactoring in scsi_lib.c"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (207 commits)
scsi: ufs: ti-j721e-ufs: Fix unwinding of pm_runtime changes
scsi: cxgb3i: Fix some leaks in init_act_open()
scsi: ibmvscsi: Make some functions static
scsi: iscsi: Fix deadlock on recovery path during GFP_IO reclaim
scsi: ufs: Fix WriteBooster flush during runtime suspend
scsi: ufs: Fix index of attributes query for WriteBooster feature
scsi: ufs: Allow WriteBooster on UFS 2.2 devices
scsi: ufs: Remove unnecessary memset for dev_info
scsi: ufs-qcom: Fix scheduling while atomic issue
scsi: mpt3sas: Fix reply queue count in non RDPQ mode
scsi: lpfc: Fix lpfc_nodelist leak when processing unsolicited event
scsi: target: tcmu: Fix a use after free in tcmu_check_expired_queue_cmd()
scsi: vhost: Notify TCM about the maximum sg entries supported per command
scsi: qla2xxx: Remove return value from qla_nvme_ls()
scsi: qla2xxx: Remove an unused function
scsi: iscsi: Register sysfs for iscsi workqueue
scsi: scsi_debug: Parser tables and code interaction
scsi: core: Refactor scsi_mq_setup_tags function
scsi: core: Fix incorrect usage of shost_for_each_device
scsi: qla2xxx: Fix endianness annotations in source files
...
Linus Torvalds [Fri, 5 Jun 2020 21:05:57 +0000 (14:05 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma updates from Jason Gunthorpe:
"A more active cycle than most of the recent past, with a few large,
long discussed works this time.
The RNBD block driver has been posted for nearly two years now, and
flowing through RDMA due to it also introducing a new ULP.
The removal of FMR has been a recurring discussion theme for a long
time.
And the usual smattering of features and bug fixes.
Summary:
- Various small driver bugs fixes in rxe, mlx5, hfi1, and efa
- Continuing driver cleanups in bnxt_re, hns
- Big cleanup of mlx5 QP creation flows
- More consistent use of src port and flow label when LAG is used and
a mlx5 implementation
- Additional set of cleanups for IB CM
- 'RNBD' network block driver and target. This is a network block
RDMA device specific to ionos's cloud environment. It brings strong
multipath and resiliency capabilities.
- Accelerated IPoIB for HFI1
- QP/WQ/SRQ ioctl migration for uverbs, and support for multiple
async fds
- Support for exchanging the new IBTA defiend ECE data during RDMA CM
exchanges
- Removal of the very old and insecure FMR interface from all ULPs
and drivers. FRWR should be preferred for at least a decade now"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits)
RDMA/cm: Spurious WARNING triggered in cm_destroy_id()
RDMA/mlx5: Return ECE DC support
RDMA/mlx5: Don't rely on FW to set zeros in ECE response
RDMA/mlx5: Return an error if copy_to_user fails
IB/hfi1: Use free_netdev() in hfi1_netdev_free()
RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr()
RDMA/core: Move and rename trace_cm_id_create()
IB/hfi1: Fix hfi1_netdev_rx_init() error handling
RDMA: Remove 'max_map_per_fmr'
RDMA: Remove 'max_fmr'
RDMA/core: Remove FMR device ops
RDMA/rdmavt: Remove FMR memory registration
RDMA/mthca: Remove FMR support for memory registration
RDMA/mlx4: Remove FMR support for memory registration
RDMA/i40iw: Remove FMR leftovers
RDMA/bnxt_re: Remove FMR leftovers
RDMA/mlx5: Remove FMR leftovers
RDMA/core: Remove FMR pool API
RDMA/rds: Remove FMR support for memory registration
RDMA/srp: Remove support for FMR memory registration
...
Linus Torvalds [Fri, 5 Jun 2020 21:00:30 +0000 (14:00 -0700)]
Merge tag 'gpio-v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO updates from Linus Walleij:
"This is the bulk of GPIO changes for the v5.8 kernel cycle.
Core changes:
- A new GPIO aggregator driver has been merged: this can join a few
select GPIO lines into a new aggregated GPIO chip. This can be used
for security: a process can be granted access to only these lines,
for example for industrial control. Another way to use this is to
reexpose certain select lines to a virtual machine or container.
- Warn if the gpio-line-names is too long in he DT parser core.
- GPIO lines can now be looked up by line name in addition to being
looked up by offset.
New drivers:
- A new generic regmap GPIO driver has been merged. Too many regmap
drivers are starting to look like each other so we need to create
some common ground and try to move drivers over to using that.
- The F7188X driver now supports F81865.
Driver improvements:
- Large improvements to the PCA953x expander, get multiple lines and
several cleanups.
- Large improvements to the DesignWare DWAPB driver, and Sergey Semin
has volunteered to maintain it.
- PL061 can now be built as a module, this is part of a bigger effort
to make the ARM platforms more modular"
* tag 'gpio-v5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio: (77 commits)
gpio: pca953x: Drop unneeded ACPI_PTR()
MAINTAINERS: Add gpio regmap section
gpio: add a reusable generic gpio_chip using regmap
gpiolib: Introduce gpiochip_irqchip_add_domain()
gpio: gpiolib: Allow GPIO IRQs to lazy disable
gpiolib: Separate GPIO_GET_LINEINFO_WATCH_IOCTL conditional
gpio: rcar: Fix runtime PM imbalance on error
gpio: pca935x: Allow IRQ support for driver built as a module
gpio: pxa: Add COMPILE_TEST support
dt-bindings: gpio: Add renesas,em-gio bindings
MAINTAINERS: Fix file name for DesignWare GPIO DT schema
gpio: dwapb: Remove unneeded has_irq member in struct dwapb_port_property
gpio: dwapb: Don't use IRQ 0 as valid Linux interrupt
gpio: dwapb: avoid error message for optional IRQ
gpio: dwapb: Call acpi_gpiochip_free_interrupts() on GPIO chip de-registration
gpio: max730x: bring gpiochip_add_data after port config
MAINTAINERS: Add GPIO Aggregator section
docs: gpio: Add GPIO Aggregator documentation
gpio: Add GPIO Aggregator
gpiolib: Add support for GPIO lookup by line name
...
Linus Torvalds [Fri, 5 Jun 2020 20:58:04 +0000 (13:58 -0700)]
Merge tag 'for-linus-5.8-1' of git://github.com/cminyard/linux-ipmi
Pull IPMI updates from Corey Minyard:
"A few small fixes for things, nothing earth shattering"
* tag 'for-linus-5.8-1' of git://github.com/cminyard/linux-ipmi:
ipmi:ssif: Remove dynamic platform device handing
Try to load acpi_ipmi when an SSIF ACPI IPMI interface is added
ipmi_si: Load acpi_ipmi when ACPI IPMI interface added
ipmi:bt-bmc: Fix error handling and status check
ipmi: Replace guid_copy() with import_guid() where it makes sense
ipmi: use vzalloc instead of kmalloc for user creation
ipmi:bt-bmc: Fix some format issue of the code
ipmi:bt-bmc: Avoid unnecessary check
Linus Torvalds [Fri, 5 Jun 2020 20:51:49 +0000 (13:51 -0700)]
Merge tag 'vfio-v5.8-rc1' of git://github.com/awilliam/linux-vfio
Pull VFIO updates from Alex Williamson:
- Block accesses to disabled MMIO space (Alex Williamson)
- VFIO device migration API (Kirti Wankhede)
- type1 IOMMU dirty bitmap API and implementation (Kirti Wankhede)
- PCI NULL capability masking (Alex Williamson)
- Memory leak fixes (Qian Cai)
- Reference leak fix (Qiushi Wu)
* tag 'vfio-v5.8-rc1' of git://github.com/awilliam/linux-vfio:
vfio iommu: typecast corrections
vfio iommu: Use shift operation for 64-bit integer division
vfio/mdev: Fix reference count leak in add_mdev_supported_type
vfio: Selective dirty page tracking if IOMMU backed device pins pages
vfio iommu: Add migration capability to report supported features
vfio iommu: Update UNMAP_DMA ioctl to get dirty bitmap before unmap
vfio iommu: Implementation of ioctl for dirty pages tracking
vfio iommu: Add ioctl definition for dirty pages tracking
vfio iommu: Cache pgsize_bitmap in struct vfio_iommu
vfio iommu: Remove atomicity of ref_count of pinned pages
vfio: UAPI for migration interface for device state
vfio/pci: fix memory leaks of eventfd ctx
vfio/pci: fix memory leaks in alloc_perm_bits()
vfio-pci: Mask cap zero
vfio-pci: Invalidate mmaps and block MMIO access on disabled memory
vfio-pci: Fault mmaps to enable vma tracking
vfio/type1: Support faulting PFNMAP vmas
Linus Torvalds [Fri, 5 Jun 2020 20:45:21 +0000 (13:45 -0700)]
Merge tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull READ_IMPLIES_EXEC changes from Borislav Petkov:
"Split the old READ_IMPLIES_EXEC workaround from executable
PT_GNU_STACK now that toolchains long support PT_GNU_STACK marking and
there's no need anymore to force modern programs into having all its
user mappings executable instead of only the stack and the PROT_EXEC
ones.
Disable that automatic READ_IMPLIES_EXEC forcing on x86-64 and
arm64.
Add tables documenting how READ_IMPLIES_EXEC is handled on x86-64, arm
and arm64.
By Kees Cook"
* tag 'core_core_updates_for_5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
arm64/elf: Disable automatic READ_IMPLIES_EXEC for 64-bit address spaces
arm32/64/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
arm32/64/elf: Add tables to document READ_IMPLIES_EXEC
x86/elf: Disable automatic READ_IMPLIES_EXEC on 64-bit
x86/elf: Split READ_IMPLIES_EXEC from executable PT_GNU_STACK
x86/elf: Add table to document READ_IMPLIES_EXEC
Linus Torvalds [Fri, 5 Jun 2020 19:39:30 +0000 (12:39 -0700)]
Merge tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Support for userspace to send requests directly to the on-chip GZIP
accelerator on Power9.
- Rework of our lockless page table walking (__find_linux_pte()) to
make it safe against parallel page table manipulations without
relying on an IPI for serialisation.
- A series of fixes & enhancements to make our machine check handling
more robust.
- Lots of plumbing to add support for "prefixed" (64-bit) instructions
on Power10.
- Support for using huge pages for the linear mapping on 8xx (32-bit).
- Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound
driver.
- Removal of some obsolete 40x platforms and associated cruft.
- Initial support for booting on Power10.
- Lots of other small features, cleanups & fixes.
Thanks to: Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Andrey Abramov, Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent
Abali, Cédric Le Goater, Chen Zhou, Christian Zigotzky, Christophe
JAILLET, Christophe Leroy, Dmitry Torokhov, Emmanuel Nicolet, Erhard F.,
Gautham R. Shenoy, Geoff Levand, George Spelvin, Greg Kurz, Gustavo A.
R. Silva, Gustavo Walbon, Haren Myneni, Hari Bathini, Joel Stanley,
Jordan Niethe, Kajol Jain, Kees Cook, Leonardo Bras, Madhavan
Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Michal
Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram Pai,
Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler,
Wolfram Sang, Xiongfeng Wang.
* tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (299 commits)
powerpc/pseries: Make vio and ibmebus initcalls pseries specific
cxl: Remove dead Kconfig options
powerpc: Add POWER10 architected mode
powerpc/dt_cpu_ftrs: Add MMA feature
powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
powerpc: Add support for ISA v3.1
powerpc: Add new HWCAP bits
powerpc/64s: Don't set FSCR bits in INIT_THREAD
powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
powerpc/64s: Don't let DT CPU features set FSCR_DSCR
powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUG
powerpc/module_64: Use special stub for _mcount() with -mprofile-kernel
powerpc/module_64: Simplify check for -mprofile-kernel ftrace relocations
powerpc/module_64: Consolidate ftrace code
powerpc/32: Disable KASAN with pages bigger than 16k
powerpc/uaccess: Don't set KUEP by default on book3s/32
powerpc/uaccess: Don't set KUAP by default on book3s/32
powerpc/8xx: Reduce time spent in allow_user_access() and friends
...
Linus Torvalds [Fri, 5 Jun 2020 19:31:16 +0000 (12:31 -0700)]
Merge tag 'modules-for-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
- Harden CONFIG_STRICT_MODULE_RWX by rejecting any module that has
SHF_WRITE|SHF_EXECINSTR sections
- Remove and clean up nested #ifdefs, as it makes code hard to read
* tag 'modules-for-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: Harden STRICT_MODULE_RWX
module: break nested ARCH_HAS_STRICT_MODULE_RWX and STRICT_MODULE_RWX #ifdefs
Eric Biggers [Thu, 4 Jun 2020 19:01:26 +0000 (12:01 -0700)]
dm crypt: avoid truncating the logical block size
queue_limits::logical_block_size got changed from unsigned short to
unsigned int, but it was forgotten to update crypt_io_hints() to use the
new type. Fix it.
Fixes: 47f50cdf5154 ("block: fix an integer overflow in logical block size") Cc: stable@vger.kernel.org Signed-off-by: Eric Biggers <ebiggers@google.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Mike Snitzer [Wed, 27 May 2020 20:32:51 +0000 (16:32 -0400)]
dm mpath: restrict queue_if_no_path state machine
Do not allow saving disabled queue_if_no_path if already saved as
enabled; implies multiple suspends (which shouldn't ever happen). Log
if this unlikely scenario is ever triggered.
Also, only write MPATHF_SAVED_QUEUE_IF_NO_PATH during presuspend or if
"fail_if_no_path" message. MPATHF_SAVED_QUEUE_IF_NO_PATH is no longer
always modified, e.g.: even if queue_if_no_path()'s save_old_value
argument wasn't set. This just implies a bit tighter control over
the management of MPATHF_SAVED_QUEUE_IF_NO_PATH. Side-effect is
multipath_resume() doesn't reset MPATHF_QUEUE_IF_NO_PATH unless
MPATHF_SAVED_QUEUE_IF_NO_PATH was set (during presuspend); and at that
time the MPATHF_SAVED_QUEUE_IF_NO_PATH bit gets cleared. So
MPATHF_SAVED_QUEUE_IF_NO_PATH's use is much more narrow in scope.
Last, but not least, do _not_ disable queue_if_no_path during noflush
suspend. There is no need/benefit to saving off queue_if_no_path via
MPATHF_SAVED_QUEUE_IF_NO_PATH and clearing MPATHF_QUEUE_IF_NO_PATH for
noflush suspend -- by avoiding this needless queue_if_no_path flag
churn there is less potential for MPATHF_QUEUE_IF_NO_PATH to get lost.
Which avoids potential for IOs to be errored back up to userspace
during DM multipath's handling of path failures.
That said, this last change papers over a reported issue concerning
request-based dm-multipath's interaction with blk-mq, relative to
suspend and resume: multipath_endio is being called _before_
multipath_resume. This should never happen if DM suspend's
blk_mq_quiesce_queue() + dm_wait_for_completion() is genuinely waiting
for all inflight blk-mq requests to complete. Similarly:
drivers/md/dm.c:__dm_resume() clearly calls dm_table_resume_targets()
_before_ dm_start_queue()'s blk_mq_unquiesce_queue() is called. If
the queue isn't even restarted until after multipath_resume(); the BIG
question that still needs answering is: how can multipath_end_io beat
multipath_resume in a race!?
Mike Snitzer [Tue, 26 May 2020 20:06:56 +0000 (16:06 -0400)]
dm mpath: simplify __must_push_back
Remove micro-optimization that infers device is between presuspend and
resume (was done purely to avoid call to dm_noflush_suspending, which
isn't expensive anyway).
Remove flags argument since they are no longer checked.
And remove must_push_back_bio() since it was simply a call to
__must_push_back().
Hannes Reinecke [Tue, 2 Jun 2020 11:09:53 +0000 (13:09 +0200)]
dm zoned: allocate zone by device index
When allocating a zone, pass in an indicator on which device the zone
should be allocated; this increases performance for a multi-device
setup because reclaim will now allocate zones on the device for which
reclaim is running.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Hannes Reinecke [Tue, 2 Jun 2020 11:09:50 +0000 (13:09 +0200)]
dm zoned: per-device reclaim
Instead of having one reclaim workqueue for the entire set we should
be allocating a reclaim workqueue per device; doing so will reduce
contention and should boost performance for a multi-device setup.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Hannes Reinecke [Tue, 2 Jun 2020 11:09:47 +0000 (13:09 +0200)]
dm zoned: allocate temporary superblock for tertiary devices
Checking the tertiary superblock just consists of validating UUIDs,
crcs, and the generation number; it doesn't have contents which would
be required during the actual operation.
So allocate a temporary superblock when checking tertiary devices to
avoid having to store it together with the 'real' superblocks.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com>
Hannes Reinecke [Tue, 2 Jun 2020 11:09:46 +0000 (13:09 +0200)]
dm zoned: convert to xarray
The zones array is getting really large, and large arrays tend to
wreak havoc with the CPU caches. So convert it to xarray to become
more cache friendly.
Signed-off-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Damien Le Moal <damien.lemoal@wdc.com> Signed-off-by: Colin Ian King <colin.king@canonical.com> # fix leak in dmz_insert Signed-off-by: Mike Snitzer <snitzer@redhat.com>