Andrii Nakryiko [Sun, 2 Aug 2020 01:32:17 +0000 (18:32 -0700)]
libbpf: Add btf__parse_raw() and generic btf__parse() APIs
Add public APIs to parse BTF from raw data file (e.g.,
/sys/kernel/btf/vmlinux), as well as generic btf__parse(), which will try to
determine correct format, currently either raw or ELF.
tools, build: Propagate build failures from tools/build/Makefile.build
The '&&' command seems to have a bad effect when $(cmd_$(1)) exits with
non-zero effect: the command failure is masked (despite `set -e`) and all but
the first command of $(dep-cmd) is executed (successfully, as they are mostly
printfs), thus overall returning 0 in the end.
This means in practice that despite compilation errors, tools's build Makefile
will return success. We see this very reliably with libbpf's Makefile, which
doesn't get compilation error propagated properly. This in turns causes issues
with selftests build, as well as bpftool and other projects that rely on
building libbpf.
The fix is simple: don't use &&. Given `set -e`, we don't need to chain
commands with &&. The shell will exit on first failure, giving desired
behavior and propagating error properly.
Fixes: 275e2d95591e ("tools build: Move dependency copy into function") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jiri Olsa <jolsa@redhat.com> Link: https://lore.kernel.org/bpf/20200731024244.872574-1-andriin@fb.com
selftests/bpf: Fix spurious test failures in core_retro selftest
core_retro selftest uses BPF program that's triggered on sys_enter
system-wide, but has no protection from some unrelated process doing syscall
while selftest is running. This leads to occasional test failures with
unexpected PIDs being returned. Fix that by filtering out all processes that
are not test_progs process.
Fixes: fcda189a5133 ("selftests/bpf: Add test relying only on CO-RE and no recent kernel features") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200731204957.2047119-1-andriin@fb.com
====================
This patch set adds new BPF link operation, LINK_DETACH, allowing processes
with BPF link FD to force-detach it from respective BPF hook, similarly how
BPF link is auto-detached when such BPF hook (e.g., cgroup, net_device, netns,
etc) is removed. This facility allows admin to forcefully undo BPF link
attachment, while process that created BPF link in the first place is left
intact.
Once force-detached, BPF link stays valid in the kernel as long as there is at
least one FD open against it. It goes into defunct state, just like
auto-detached BPF link.
bpftool also got `link detach` command to allow triggering this in
non-programmatic fashion.
v1->v2:
- improve error reporting in `bpftool link detach` (Song).
====================
Add LINK_DETACH command to force-detach bpf_link without destroying it. It has
the same behavior as auto-detaching of bpf_link due to cgroup dying for
bpf_cgroup_link or net_device being destroyed for bpf_xdp_link. In such case,
bpf_link is still a valid kernel object, but is defuncts and doesn't hold BPF
program attached to corresponding BPF hook. This functionality allows users
with enough access rights to manually force-detach attached bpf_link without
killing respective owner process.
This patch implements LINK_DETACH for cgroup, xdp, and netns links, mostly
re-using existing link release handling code.
Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20200731182830.286260-2-andriin@fb.com
John Fastabend [Fri, 31 Jul 2020 22:09:14 +0000 (15:09 -0700)]
bpf, selftests: Use single cgroup helpers for both test_sockmap/progs
Nearly every user of cgroup helpers does the same sequence of API calls. So
push these into a single helper cgroup_setup_and_join. The cases that do
a bit of extra logic are test_progs which currently uses an env variable
to decide if it needs to setup the cgroup environment or can use an
existingi environment. And then tests that are doing cgroup tests
themselves. We skip these cases for now.
Jakub Sitnicki [Sun, 26 Jul 2020 12:02:28 +0000 (14:02 +0200)]
udp, bpf: Ignore connections in reuseport group after BPF sk lookup
When BPF sk lookup invokes reuseport handling for the selected socket, it
should ignore the fact that reuseport group can contain connected UDP
sockets. With BPF sk lookup this is not relevant as we are not scoring
sockets to find the best match, which might be a connected UDP socket.
Fix it by unconditionally accepting the socket selected by reuseport.
This fixes the following two failures reported by test_progs.
# ./test_progs -t sk_lookup
...
#73/14 UDP IPv4 redir and reuseport with conns:FAIL
...
#73/20 UDP IPv6 redir and reuseport with conns:FAIL
...
Fixes: a57066b1a019 ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net") Reported-by: Alexei Starovoitov <alexei.starovoitov@gmail.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200726120228.1414348-1-jakub@cloudflare.com
libbpf: Make destructors more robust by handling ERR_PTR(err) cases
Most of libbpf "constructors" on failure return ERR_PTR(err) result encoded as
a pointer. It's a common mistake to eventually pass such malformed pointers
into xxx__destroy()/xxx__free() "destructors". So instead of fixing up
clean up code in selftests and user programs, handle such error pointers in
destructors themselves. This works beautifully for NULL pointers passed to
destructors, so might as well just work for error pointers.
Suggested-by: Song Liu <songliubraving@fb.com> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200729232148.896125-1-andriin@fb.com
Jakub Sitnicki [Thu, 30 Jul 2020 12:53:25 +0000 (14:53 +0200)]
selftests/bpf: Omit nodad flag when adding addresses to loopback
Setting IFA_F_NODAD flag for IPv6 addresses to add to loopback is
unnecessary. Duplicate Address Detection does not happen on loopback
device.
Also, passing 'nodad' flag to 'ip address' breaks libbpf CI, which runs in
an environment with BusyBox implementation of 'ip' command, that doesn't
understand this flag.
Fixes: 0ab5539f8584 ("selftests/bpf: Tests for BPF_SK_LOOKUP attach point") Reported-by: Andrii Nakryiko <andrii.nakryiko@gmail.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: Andrii Nakryiko <andrii@fb.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200730125325.1869363-1-jakub@cloudflare.com
Check that link is NULL or proper pointer before invoking bpf_link__destroy().
Not doing this causes crash in test_progs, when cg_storage_multi selftest
fails.
Fixes: 3573f384014f ("selftests/bpf: Test CGROUP_STORAGE behavior on shared egress + ingress") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200729045056.3363921-1-andriin@fb.com
Using existing 'bpf_sk_storage_get_proto' doesn't work because
second argument is ARG_PTR_TO_SOCKET. Even though
BPF_PROG_TYPE_CGROUP_SOCK hooks operate on 'struct bpf_sock',
the verifier still considers it as a PTR_TO_CTX.
That's why I'm adding another 'bpf_sk_storage_get_cg_sock_proto'
definition strictly for BPF_PROG_TYPE_CGROUP_SOCK which accepts
ARG_PTR_TO_CTX which is really 'struct sock' for this program type.
Yonghong Song [Tue, 28 Jul 2020 22:18:01 +0000 (15:18 -0700)]
selftests/bpf: Test bpf_iter buffer access with negative offset
Commit afbf21dce668 ("bpf: Support readonly/readwrite buffers
in verifier") added readonly/readwrite buffer support which
is currently used by bpf_iter tracing programs. It has
a bug with incorrect parameter ordering which later fixed
by Commit f6dfbe31e8fa ("bpf: Fix swapped arguments in calls
to check_buffer_access").
This patch added a test case with a negative offset access
which will trigger the error path.
Without Commit f6dfbe31e8fa, running the test case in the patch,
the error message looks like:
R1_w=rdwr_buf(id=0,off=0,imm=0) R10=fp0
; value_sum += *(__u32 *)(value - 4);
2: (61) r1 = *(u32 *)(r1 -4)
R1 invalid (null) buffer access: off=-4, size=4
Yonghong Song [Tue, 28 Jul 2020 22:18:01 +0000 (15:18 -0700)]
bpf: Add missing newline characters in verifier error messages
Newline characters are added in two verifier error messages,
refactored in Commit afbf21dce668 ("bpf: Support readonly/readwrite
buffers in verifier"). This way, they do not mix with
messages afterwards.
When a tracing BPF program attempts to read memory without using the
bpf_probe_read() helper, the verifier marks the load instruction with
the BPF_PROBE_MEM flag. Since the arm64 JIT does not currently recognize
this flag it falls back to the interpreter.
Add support for BPF_PROBE_MEM, by appending an exception table to the
BPF program. If the load instruction causes a data abort, the fixup
infrastructure finds the exception table and fixes up the fault, by
clearing the destination register and jumping over the faulting
instruction.
To keep the compact exception table entry format, inspect the pc in
fixup_exception(). A more generic solution would add a "handler" field
to the table entry, like on x86 and s390.
bpf: Fix build without CONFIG_NET when using BPF XDP link
Entire net/core subsystem is not built without CONFIG_NET. linux/netdevice.h
just assumes that it's always there, so the easiest way to fix this is to
conditionally compile out bpf_xdp_link_attach() use in bpf/syscall.c.
Fixes: aa8d3a716b59 ("bpf, xdp: Add bpf_link-based XDP attachment API") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200728190527.110830-1-andriin@fb.com
John Fastabend [Tue, 28 Jul 2020 14:39:02 +0000 (07:39 -0700)]
bpf, selftests: use :: 1 for localhost in tcp_server.py
Using localhost requires the host to have a /etc/hosts file with that
specific line in it. By default my dev box did not, they used
ip6-localhost, so the test was failing. To fix remove the need for any
/etc/hosts and use ::1.
Peilin Ye [Tue, 28 Jul 2020 05:36:04 +0000 (01:36 -0400)]
xdp: Prevent kernel-infoleak in xsk_getsockopt()
xsk_getsockopt() is copying uninitialized stack memory to userspace when
'extra_stats' is 'false'. Fix it. Doing '= {};' is sufficient since currently
'struct xdp_statistics' is defined as follows:
When being copied to the userspace, 'stats' will not contain any uninitialized
'holes' between struct fields.
Fixes: 8aa5a33578e9 ("xsk: Add new statistics") Suggested-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Song Liu <songliubraving@fb.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/bpf/20200728053604.404631-1-yepeilin.cs@gmail.com
Colin Ian King [Mon, 27 Jul 2020 17:54:11 +0000 (18:54 +0100)]
bpf: Fix swapped arguments in calls to check_buffer_access
There are a couple of arguments of the boolean flag zero_size_allowed and
the char pointer buf_info when calling to function check_buffer_access that
are swapped by mistake. Fix these by swapping them to correct the argument
ordering.
Fixes: afbf21dce668 ("bpf: Support readonly/readwrite buffers in verifier")
Addresses-Coverity: ("Array compared to 0") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200727175411.155179-1-colin.king@canonical.com
selftests/bpf: Add new bpf_iter context structs to fix build on old kernels
Add bpf_iter__bpf_map_elem and bpf_iter__bpf_sk_storage_map to bpf_iter.h.
Fixes: 3b1c420bd882 ("selftests/bpf: Add a test for bpf sk_storage_map iterator") Fixes: 2a7c2fff7dd6 ("selftests/bpf: Add test for bpf hash map iterators") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20200727233345.1686358-1-andriin@fb.com
bpf: Fix bpf_ringbuf_output() signature to return long
Due to bpf tree fix merge, bpf_ringbuf_output() signature ended up with int as
a return type, while all other helpers got converted to returning long. So fix
it in bpf-next now.
Fixes: b0659d8a950d ("bpf: Fix definition of bpf_ringbuf_output() helper in UAPI comments") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Song Liu <songliubraving@fb.com> Link: https://lore.kernel.org/bpf/20200727224715.652037-1-andriin@fb.com
tools, bpftool: Add LSM type to array of prog names
Assign "lsm" as a printed name for BPF_PROG_TYPE_LSM in bpftool, so that
it can use it when listing programs loaded on the system or when probing
features.
tools, bpftool: Skip type probe if name is not found
For probing program and map types, bpftool loops on type values and uses
the relevant type name in prog_type_name[] or map_type_name[]. To ensure
the name exists, we exit from the loop if we go over the size of the
array.
However, this is not enough in the case where the arrays have "holes" in
them, program or map types for which they have no name, but not at the
end of the list. This is currently the case for BPF_PROG_TYPE_LSM, not
known to bpftool and which name is a null string. When probing for
features, bpftool attempts to strlen() that name and segfaults.
Let's fix it by skipping probes for "unknown" program and map types,
with an informational message giving the numeral value in that case.
Fixes: 93a3545d812a ("tools/bpftool: Add name mappings for SK_LOOKUP prog and attach type") Reported-by: Paul Chaignon <paul@cilium.io> Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20200724090618.16378-2-quentin@isovalent.com
====================
Following cgroup and netns examples, implement bpf_link support for XDP.
The semantics is described in patch #2. Program and link attachments are
mutually exclusive, in the sense that neither link can replace attached
program nor program can replace attached link. Link can't replace attached
link as well, as is the case for any other bpf_link implementation.
Patch #1 refactors existing BPF program-based attachment API and centralizes
high-level query/attach decisions in generic kernel code, while drivers are
kept simple and are instructed with low-level decisions about attaching and
detaching specific bpf_prog. This also makes QUERY command unnecessary, and
patch #8 removes support for it from all kernel drivers. If that's a bad idea,
we can drop that patch altogether.
With refactoring in patch #1, adding bpf_xdp_link is completely transparent to
drivers, they are still functioning at the level of "effective" bpf_prog, that
should be called in XDP data path.
Corresponding libbpf support for BPF XDP link is added in patch #5.
v3->v4:
- fix a compilation warning in one of drivers (Jakub);
v2->v3:
- fix build when CONFIG_BPF_SYSCALL=n (kernel test robot);
v1->v2:
- fix prog refcounting bug (David);
- split dev_change_xdp_fd() changes into 2 patches (David);
- add extack messages to all user-induced errors (David).
====================
bpf, xdp: Remove XDP_QUERY_PROG and XDP_QUERY_PROG_HW XDP commands
Now that BPF program/link management is centralized in generic net_device
code, kernel code never queries program id from drivers, so
XDP_QUERY_PROG/XDP_QUERY_PROG_HW commands are unnecessary.
This patch removes all the implementations of those commands in kernel, along
the xdp_attachment_query().
Sync UAPI header and add support for using bpf_link-based XDP attachment.
Make xdp/ prog type set expected attach type. Kernel didn't enforce
attach_type for XDP programs before, so there is no backwards compatiblity
issues there.
Also fix section_names selftest to recognize that xdp prog types now have
expected attach type.
Add bpf_link-based API (bpf_xdp_link) to attach BPF XDP program through
BPF_LINK_CREATE command.
bpf_xdp_link is mutually exclusive with direct BPF program attachment,
previous BPF program should be detached prior to attempting to create a new
bpf_xdp_link attachment (for a given XDP mode). Once BPF link is attached, it
can't be replaced by other BPF program attachment or link attachment. It will
be detached only when the last BPF link FD is closed.
bpf_xdp_link will be auto-detached when net_device is shutdown, similarly to
how other BPF links behave (cgroup, flow_dissector). At that point bpf_link
will become defunct, but won't be destroyed until last FD is closed.
bpf, xdp: Extract common XDP program attachment logic
Further refactor XDP attachment code. dev_change_xdp_fd() is split into two
parts: getting bpf_progs from FDs and attachment logic, working with
bpf_progs. This makes attachment logic a bit more straightforward and
prepares code for bpf_xdp_link inclusion, which will share the common logic.
bpf, xdp: Maintain info on attached XDP BPF programs in net_device
Instead of delegating to drivers, maintain information about which BPF
programs are attached in which XDP modes (generic/skb, driver, or hardware)
locally in net_device. This effectively obsoletes XDP_QUERY_PROG command.
Such re-organization simplifies existing code already. But it also allows to
further add bpf_link-based XDP attachments without drivers having to know
about any of this at all, which seems like a good setup.
XDP_SETUP_PROG/XDP_SETUP_PROG_HW are just low-level commands to driver to
install/uninstall active BPF program. All the higher-level concerns about
prog/link interaction will be contained within generic driver-agnostic logic.
All the XDP_QUERY_PROG calls to driver in dev_xdp_uninstall() were removed.
It's not clear for me why dev_xdp_uninstall() were passing previous prog_flags
when resetting installed programs. That seems unnecessary, plus most drivers
don't populate prog_flags anyways. Having XDP_SETUP_PROG vs XDP_SETUP_PROG_HW
should be enough of an indicator of what is required of driver to correctly
reset active BPF program. dev_xdp_uninstall() is also generalized as an
iteration over all three supported mode.
bpf: Make bpf_link API available indepently of CONFIG_BPF_SYSCALL
Similarly to bpf_prog, make bpf_link and related generic API available
unconditionally to make it easier to have bpf_link support in various parts of
the kernel. Stub out init/prime/settle/cleanup and inc/put APIs.
Song Liu [Fri, 24 Jul 2020 20:05:02 +0000 (13:05 -0700)]
bpf: Fix build on architectures with special bpf_user_pt_regs_t
Architectures like s390, powerpc, arm64, riscv have speical definition of
bpf_user_pt_regs_t. So we need to cast the pointer before passing it to
bpf_get_stack(). This is similar to bpf_get_stack_tp().
Fixes: 03d42fd2d83f ("bpf: Separate bpf_get_[stack|stackid] for perf events BPF") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200724200503.3629591-1-songliubraving@fb.com
bpf/local_storage: Fix build without CONFIG_CGROUP
local_storage.o has its compile guard as CONFIG_BPF_SYSCALL, which
does not imply that CONFIG_CGROUP is on. Including cgroup-internal.h
when CONFIG_CGROUP is off cause a compilation failure.
Fixes: f67cfc233706 ("bpf: Make cgroup storages shared between programs on the same cgroup") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: YiFei Zhu <zhuyifei@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20200724211753.902969-1-zhuyifei1999@gmail.com
====================
To access the storage in a CGROUP_STORAGE map, one uses
bpf_get_local_storage helper, which is extremely fast due to its
use of per-CPU variables. However, its whole code is built on
the assumption that one map can only be used by one program at any
time, and this prohibits any sharing of data between multiple
programs using these maps, eliminating a lot of use cases, such
as some per-cgroup configuration storage, written to by a
setsockopt program and read by a cg_sock_addr program.
Why not use other map types? The great part of CGROUP_STORAGE map
is that it is isolated by different cgroups its attached to. When
one program uses bpf_get_local_storage, even on the same map, it
gets different storages if it were run as a result of attaching
to different cgroups. The kernel manages the storages, simplifying
BPF program or userspace. In theory, one could probably use other
maps like array or hash to do the same thing, but it would be a
major overhead / complexity. Userspace needs to know when a cgroup
is being freed in order to free up a space in the replacement map.
This patch set introduces a significant change to the semantics of
CGROUP_STORAGE map type. Instead of each storage being tied to one
single attachment, it is shared across different attachments to
the same cgroup, and persists until either the map or the cgroup
attached to is being freed.
User may use u64 as the key to the map, and the result would be
that the attach type become ignored during key comparison, and
programs of different attach types will share the same storage if
the cgroups they are attached to are the same.
How could this break existing users?
* Users that uses detach & reattach / program replacement as a
shortcut to zeroing the storage. Since we need sharing between
programs, we cannot zero the storage. Users that expect this
behavior should either attach a program with a new map, or
explicitly zero the map with a syscall.
This case is dependent on undocumented implementation details,
so the impact should be very minimal.
Patch 1 introduces a test on the old expected behavior of the map
type.
Patch 2 introduces a test showing how two programs cannot share
one such map.
Patch 3 implements the change of semantics to the map.
Patch 4 amends the new test such that it yields the behavior we
expect from the change.
Patch 5 documents the map type.
Changes since RFC:
* Clarify commit message in patch 3 such that it says the lifetime
of the storage is ended at the freeing of the cgroup_bpf, rather
than the cgroup itself.
* Restored an -ENOMEM check in __cgroup_bpf_attach.
* Update selftests for recent change in network_helpers API.
Changes since v1:
* s/CHECK_FAIL/CHECK/
* s/bpf_prog_attach/bpf_program__attach_cgroup/
* Moved test__start_subtest to test_cg_storage_multi.
* Removed some redundant CHECK_FAIL where they are already CHECK-ed.
Changes since v2:
* Lock cgroup_mutex during map_free.
* Publish new storages only if attach is successful, by tracking
exactly which storages are reused in an array of bools.
* Mention bpftool map dump showing a value of zero for attach_type
in patch 3 commit message.
Changes since v3:
* Use a much simpler lookup and allocate-if-not-exist from the fact
that cgroup_mutex is locked during attach.
* Removed an unnecessary spinlock hold.
Changes since v4:
* Changed semantics so that if the key type is struct
bpf_cgroup_storage_key the map retains isolation between different
attach types. Sharing between different attach types only occur
when key type is u64.
* Adapted tests and docs for the above change.
Changes since v5:
* Removed redundant NULL check before bpf_link__destroy.
* Free BPF object explicitly, after asserting that object failed to
load, in the event that the object did not fail to load.
* Rename variable in bpf_cgroup_storage_key_cmp for clarity.
* Added a lot of information to Documentation, more or less copied
from what Martin KaFai Lau wrote.
====================
Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Documentation/bpf: Document CGROUP_STORAGE map type
The machanics and usage are not very straightforward. Given the
changes it's better to document how it works and how to use it,
rather than having to rely on the examples and implementation to
infer what is going on.
selftests/bpf: Test CGROUP_STORAGE behavior on shared egress + ingress
This mirrors the original egress-only test. The cgroup_storage is
now extended to have two packet counters, one for egress and one
for ingress. We also extend to have two egress programs to test
that egress will always share with other egress origrams in the
same cgroup. The behavior of the counters are exactly the same as
the original egress-only test.
The test is split into two, one "isolated" test that when the key
type is struct bpf_cgroup_storage_key, which contains the attach
type, programs of different attach types will see different
storages. The other, "shared" test that when the key type is u64,
programs of different attach types will see the same storage if
they are attached to the same cgroup.
====================
Calling get_perf_callchain() on perf_events from PEBS entries may cause
unwinder errors. To fix this issue, perf subsystem fetches callchain early,
and marks perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.
Similar issue exists when BPF program calls get_perf_callchain() via
helper functions. For more information about this issue, please refer to
discussions in [1].
This set fixes this issue with helper proto bpf_get_stackid_pe and
bpf_get_stack_pe.
Changes v4 => v5:
1. Return -EPROTO instead of -EINVAL on PERF_EVENT_IOC_SET_BPF errors.
(Alexei)
2. Let libbpf print a hint message when PERF_EVENT_IOC_SET_BPF returns
-EPROTO. (Alexei)
Changes v3 => v4:
1. Fix error check logic in bpf_get_stackid_pe and bpf_get_stack_pe.
(Alexei)
2. Do not allow attaching BPF programs with bpf_get_stack|stackid to
perf_event with precise_ip > 0, but not proper callchain. (Alexei)
3. Add selftest get_stackid_cannot_attach.
Changes v2 => v3:
1. Fix handling of stackmap skip field. (Andrii)
2. Simplify the code in a few places. (Andrii)
Changes v1 => v2:
1. Simplify the design and avoid introducing new helper function. (Andrii)
====================
bpf: Make cgroup storages shared between programs on the same cgroup
This change comes in several parts:
One, the restriction that the CGROUP_STORAGE map can only be used
by one program is removed. This results in the removal of the field
'aux' in struct bpf_cgroup_storage_map, and removal of relevant
code associated with the field, and removal of now-noop functions
bpf_free_cgroup_storage and bpf_cgroup_storage_release.
Second, we permit a key of type u64 as the key to the map.
Providing such a key type indicates that the map should ignore
attach type when comparing map keys. However, for simplicity newly
linked storage will still have the attach type at link time in
its key struct. cgroup_storage_check_btf is adapted to accept
u64 as the type of the key.
Third, because the storages are now shared, the storages cannot
be unconditionally freed on program detach. There could be two
ways to solve this issue:
* A. Reference count the usage of the storages, and free when the
last program is detached.
* B. Free only when the storage is impossible to be referred to
again, i.e. when either the cgroup_bpf it is attached to, or
the map itself, is freed.
Option A has the side effect that, when the user detach and
reattach a program, whether the program gets a fresh storage
depends on whether there is another program attached using that
storage. This could trigger races if the user is multi-threaded,
and since nondeterminism in data races is evil, go with option B.
The both the map and the cgroup_bpf now tracks their associated
storages, and the storage unlink and free are removed from
cgroup_bpf_detach and added to cgroup_bpf_release and
cgroup_storage_map_free. The latter also new holds the cgroup_mutex
to prevent any races with the former.
Fourth, on attach, we reuse the old storage if the key already
exists in the map, via cgroup_storage_lookup. If the storage
does not exist yet, we create a new one, and publish it at the
last step in the attach process. This does not create a race
condition because for the whole attach the cgroup_mutex is held.
We keep track of an array of new storages that was allocated
and if the process fails only the new storages would get freed.
Song Liu [Thu, 23 Jul 2020 18:06:48 +0000 (11:06 -0700)]
selftests/bpf: Add get_stackid_cannot_attach
This test confirms that BPF program that calls bpf_get_stackid() cannot
attach to perf_event with precise_ip > 0 but not PERF_SAMPLE_CALLCHAIN;
and cannot attach if the perf_event has exclude_callchain_kernel.
selftests/bpf: Test CGROUP_STORAGE map can't be used by multiple progs
The current assumption is that the lifetime of a cgroup storage
is tied to the program's attachment. The storage is created in
cgroup_bpf_attach, and released upon cgroup_bpf_detach and
cgroup_bpf_release.
Because the current semantics is that each attachment gets a
completely independent cgroup storage, and you can have multiple
programs attached to the same (cgroup, attach type) pair, the key
of the CGROUP_STORAGE map, looking up the map with this pair could
yield multiple storages, and that is not permitted. Therefore,
the kernel verifier checks that two programs cannot share the same
CGROUP_STORAGE map, even if they have different expected attach
types, considering that the actual attach type does not always
have to be equal to the expected attach type.
The test creates a CGROUP_STORAGE map and make it shared across
two different programs, one cgroup_skb/egress and one /ingress.
It asserts that the two programs cannot be both loaded, due to
verifier failure from the above reason.
Song Liu [Thu, 23 Jul 2020 18:06:47 +0000 (11:06 -0700)]
selftests/bpf: Add callchain_stackid
This tests new helper function bpf_get_stackid_pe and bpf_get_stack_pe.
These two helpers have different implementation for perf_event with PEB
entries.
selftests/bpf: Add test for CGROUP_STORAGE map on multiple attaches
This test creates a parent cgroup, and a child of that cgroup.
It attaches a cgroup_skb/egress program that simply counts packets,
to a global variable (ARRAY map), and to a CGROUP_STORAGE map.
The program is first attached to the parent cgroup only, then to
parent and child.
The test cases sends a message within the child cgroup, and because
the program is inherited across parent / child cgroups, it will
trigger the egress program for both the parent and child, if they
exist. The program, when looking up a CGROUP_STORAGE map, uses the
cgroup and attach type of the attachment parameters; therefore,
both attaches uses different cgroup storages.
We assert that all packet counts returns what we expects.
Song Liu [Thu, 23 Jul 2020 18:06:46 +0000 (11:06 -0700)]
libbpf: Print hint when PERF_EVENT_IOC_SET_BPF returns -EPROTO
The kernel prevents potential unwinder warnings and crashes by blocking
BPF program with bpf_get_[stack|stackid] on perf_event without
PERF_SAMPLE_CALLCHAIN, or with exclude_callchain_[kernel|user]. Print a
hint message in libbpf to help the user debug such issues.
====================
Bpf iterator has been implemented for task, task_file,
bpf_map, ipv6_route, netlink, tcp and udp so far.
For map elements, there are two ways to traverse all elements from
user space:
1. using BPF_MAP_GET_NEXT_KEY bpf subcommand to get elements
one by one.
2. using BPF_MAP_LOOKUP_BATCH bpf subcommand to get a batch of
elements.
Both these approaches need to copy data from kernel to user space
in order to do inspection.
This patch implements bpf iterator for map elements.
User can have a bpf program in kernel to run with each map element,
do checking, filtering, aggregation, modifying values etc.
without copying data to user space.
Patch #1 and #2 are refactoring. Patch #3 implements readonly/readwrite
buffer support in verifier. Patches #4 - #7 implements map element
support for hash, percpu hash, lru hash lru percpu hash, array,
percpu array and sock local storage maps. Patches #8 - #9 are libbpf
and bpftool support. Patches #10 - #13 are selftests for implemented
map element iterators.
Changelogs:
v3 -> v4:
. fix a kasan failure triggered by a failed bpf_iter link_create,
not just free_link but need cleanup_link. (Alexei)
v2 -> v3:
. rebase on top of latest bpf-next
v1 -> v2:
. support to modify map element values. (Alexei)
. map key/values can be used with helper arguments
for those arguments with ARG_PTR_TO_MEM or
ARG_PTR_TO_INIT_MEM register type. (Alexei)
. remove usused variable. (kernel test robot)
====================
Song Liu [Thu, 23 Jul 2020 18:06:45 +0000 (11:06 -0700)]
bpf: Fail PERF_EVENT_IOC_SET_BPF when bpf_get_[stack|stackid] cannot work
bpf_get_[stack|stackid] on perf_events with precise_ip uses callchain
attached to perf_sample_data. If this callchain is not presented, do not
allow attaching BPF program that calls bpf_get_[stack|stackid] to this
event.
In the error case, -EPROTO is returned so that libbpf can identify this
error and print proper hint message.
Yonghong Song [Thu, 23 Jul 2020 18:41:24 +0000 (11:41 -0700)]
selftests/bpf: Add a test for out of bound rdonly buf access
If the bpf program contains out of bound access w.r.t. a
particular map key/value size, the verification will be
still okay, e.g., it will be accepted by verifier. But
it will be rejected during link_create time. A test
is added here to ensure link_create failure did happen
if out of bound access happened.
$ ./test_progs -n 4
...
#4/23 rdonly-buf-out-of-bound:OK
...
Song Liu [Thu, 23 Jul 2020 18:06:44 +0000 (11:06 -0700)]
bpf: Separate bpf_get_[stack|stackid] for perf events BPF
Calling get_perf_callchain() on perf_events from PEBS entries may cause
unwinder errors. To fix this issue, the callchain is fetched early. Such
perf_events are marked with __PERF_SAMPLE_CALLCHAIN_EARLY.
Similarly, calling bpf_get_[stack|stackid] on perf_events from PEBS may
also cause unwinder errors. To fix this, add separate version of these
two helpers, bpf_get_[stack|stackid]_pe. These two hepers use callchain in
bpf_perf_event_data_kern->data->callchain.
Yonghong Song [Thu, 23 Jul 2020 18:41:19 +0000 (11:41 -0700)]
tools/bpftool: Add bpftool support for bpf map element iterator
The optional parameter "map MAP" can be added to "bpftool iter"
command to create a bpf iterator for map elements. For example,
bpftool iter pin ./prog.o /sys/fs/bpf/p1 map id 333
For map element bpf iterator "map MAP" parameter is required.
Otherwise, bpf link creation will return an error.
Quentin Monnet kindly provided bash-completion implementation
for new "map MAP" option.
Yonghong Song [Thu, 23 Jul 2020 18:41:17 +0000 (11:41 -0700)]
tools/libbpf: Add support for bpf map element iterator
Add map_fd to bpf_iter_attach_opts and flags to
bpf_link_create_opts. Later on, bpftool or selftest
will be able to create a bpf map element iterator
by passing map_fd to the kernel during link
creation time.
Yonghong Song [Thu, 23 Jul 2020 18:41:16 +0000 (11:41 -0700)]
bpf: Implement bpf iterator for sock local storage map
The bpf iterator for bpf sock local storage map
is implemented. User space interacts with sock
local storage map with fd as a key and storage value.
In kernel, passing fd to the bpf program does not
really make sense. In this case, the sock itself is
passed to bpf program.
Yonghong Song [Thu, 23 Jul 2020 18:41:15 +0000 (11:41 -0700)]
bpf: Implement bpf iterator for array maps
The bpf iterators for array and percpu array
are implemented. Similar to hash maps, for percpu
array map, bpf program will receive values
from all cpus.
Yonghong Song [Thu, 23 Jul 2020 18:41:14 +0000 (11:41 -0700)]
bpf: Implement bpf iterator for hash maps
The bpf iterators for hash, percpu hash, lru hash
and lru percpu hash are implemented. During link time,
bpf_iter_reg->check_target() will check map type
and ensure the program access key/value region is
within the map defined key/value size limit.
For percpu hash and lru hash maps, the bpf program
will receive values for all cpus. The map element
bpf iterator infrastructure will prepare value
properly before passing the value pointer to the
bpf program.
This patch set supports readonly map keys and
read/write map values. It does not support deleting
map elements, e.g., from hash tables. If there is
a user case for this, the following mechanism can
be used to support map deletion for hashtab, etc.
- permit a new bpf program return value, e.g., 2,
to let bpf iterator know the map element should
be removed.
- since bucket lock is taken, the map element will be
queued.
- once bucket lock is released after all elements under
this bucket are traversed, all to-be-deleted map
elements can be deleted.
It's mostly a copy paste of commit 6086d29def80 ("bpf: Add bpf_map iterator")
that is use to implement bpf_seq_file opreations to traverse all bpf programs.
v1->v2: Tweak to use build time btf_id
Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Yonghong Song [Thu, 23 Jul 2020 18:41:12 +0000 (11:41 -0700)]
bpf: Implement bpf iterator for map elements
The bpf iterator for map elements are implemented.
The bpf program will receive four parameters:
bpf_iter_meta *meta: the meta data
bpf_map *map: the bpf_map whose elements are traversed
void *key: the key of one element
void *value: the value of the same element
Here, meta and map pointers are always valid, and
key has register type PTR_TO_RDONLY_BUF_OR_NULL and
value has register type PTR_TO_RDWR_BUF_OR_NULL.
The kernel will track the access range of key and value
during verification time. Later, these values will be compared
against the values in the actual map to ensure all accesses
are within range.
A new field iter_seq_info is added to bpf_map_ops which
is used to add map type specific information, i.e., seq_ops,
init/fini seq_file func and seq_file private data size.
Subsequent patches will have actual implementation
for bpf_map_ops->iter_seq_info.
In user space, BPF_ITER_LINK_MAP_FD needs to be
specified in prog attr->link_create.flags, which indicates
that attr->link_create.target_fd is a map_fd.
The reason for such an explicit flag is for possible
future cases where one bpf iterator may allow more than
one possible customization, e.g., pid and cgroup id for
task_file.
Current kernel internal implementation only allows
the target to register at most one required bpf_iter_link_info.
To support the above case, optional bpf_iter_link_info's
are needed, the target can be extended to register such link
infos, and user provided link_info needs to match one of
target supported ones.
Yonghong Song [Wed, 22 Jul 2020 19:51:56 +0000 (12:51 -0700)]
bpf: Fix pos computation for bpf_iter seq_ops->start()
Currently, the pos pointer in bpf iterator map/task/task_file
seq_ops->start() is always incremented.
This is incorrect. It should be increased only if
*pos is 0 (for SEQ_START_TOKEN) since these start()
function actually returns the first real object.
If *pos is not 0, it merely found the object
based on the state in seq->private, and not really
advancing the *pos. This patch fixed this issue
by only incrementing *pos if it is 0.
Note that the old *pos calculation, although not
correct, does not affect correctness of bpf_iter
as bpf_iter seq_file->read() does not support llseek.
This patch also renamed "mid" in bpf_map iterator
seq_file private data to "map_id" for better clarity.
Yonghong Song [Thu, 23 Jul 2020 18:41:11 +0000 (11:41 -0700)]
bpf: Support readonly/readwrite buffers in verifier
Readonly and readwrite buffer register states
are introduced. Totally four states,
PTR_TO_RDONLY_BUF[_OR_NULL] and PTR_TO_RDWR_BUF[_OR_NULL]
are supported. As suggested by their respective
names, PTR_TO_RDONLY_BUF[_OR_NULL] are for
readonly buffers and PTR_TO_RDWR_BUF[_OR_NULL]
for read/write buffers.
These new register states will be used
by later bpf map element iterator.
New register states share some similarity to
PTR_TO_TP_BUFFER as it will calculate accessed buffer
size during verification time. The accessed buffer
size will be later compared to other metrics during
later attach/link_create time.
Similar to reg_state PTR_TO_BTF_ID_OR_NULL in bpf
iterator programs, PTR_TO_RDONLY_BUF_OR_NULL or
PTR_TO_RDWR_BUF_OR_NULL reg_types can be set at
prog->aux->bpf_ctx_arg_aux, and bpf verifier will
retrieve the values during btf_ctx_access().
Later bpf map element iterator implementation
will show how such information will be assigned
during target registeration time.
The verifier is also enhanced such that PTR_TO_RDONLY_BUF
can be passed to ARG_PTR_TO_MEM[_OR_NULL] helper argument, and
PTR_TO_RDWR_BUF can be passed to ARG_PTR_TO_MEM[_OR_NULL] or
ARG_PTR_TO_UNINIT_MEM.
Jakub Sitnicki [Wed, 22 Jul 2020 16:17:20 +0000 (18:17 +0200)]
selftests/bpf: Test BPF socket lookup and reuseport with connections
Cover the case when BPF socket lookup returns a socket that belongs to a
reuseport group, and the reuseport group contains connected UDP sockets.
Ensure that the presence of connected UDP sockets in reuseport group does
not affect the socket lookup result. Socket selected by reuseport should
always be used as result in such case.
Yonghong Song [Thu, 23 Jul 2020 18:41:10 +0000 (11:41 -0700)]
bpf: Refactor to provide aux info to bpf_iter_init_seq_priv_t
This patch refactored target bpf_iter_init_seq_priv_t callback
function to accept additional information. This will be needed
in later patches for map element targets since a particular
map should be passed to traverse elements for that particular
map. In the future, other information may be passed to target
as well, e.g., pid, cgroup id, etc. to customize the iterator.
Yonghong Song [Thu, 23 Jul 2020 18:41:09 +0000 (11:41 -0700)]
bpf: Refactor bpf_iter_reg to have separate seq_info member
There is no functionality change for this patch.
Struct bpf_iter_reg is used to register a bpf_iter target,
which includes information for both prog_load, link_create
and seq_file creation.
This patch puts fields related seq_file creation into
a different structure. This will be useful for map
elements iterator where one iterator covers different
map types and different map types may have different
seq_ops, init/fini private_data function and
private_data size.
Jakub Sitnicki [Wed, 22 Jul 2020 16:17:19 +0000 (18:17 +0200)]
udp: Don't discard reuseport selection when group has connections
When BPF socket lookup prog selects a socket that belongs to a reuseport
group, and the reuseport group has connected sockets in it, the socket
selected by reuseport will be discarded, and socket returned by BPF socket
lookup will be used instead.
Modify this behavior so that the socket selected by reuseport running after
BPF socket lookup always gets used. Ignore the fact that the reuseport
group might have connections because it is only relevant when scoring
sockets during regular hashtable-based lookup.
Fixes: 72f7e9440e9b ("udp: Run SK_LOOKUP BPF program on socket lookup") Fixes: 6d4201b1386b ("udp6: Run SK_LOOKUP BPF program on socket lookup") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Link: https://lore.kernel.org/bpf/20200722161720.940831-2-jakub@cloudflare.com
The UDP reuseport conflict was a little bit tricky.
The net-next code, via bpf-next, extracted the reuseport handling
into a helper so that the BPF sk lookup code could invoke it.
At the same time, the logic for reuseport handling of unconnected
sockets changed via commit efc6b6f6c3113e8b203b9debfb72d81e0f3dcace
which changed the logic to carry on the reuseport result into the
rest of the lookup loop if we do not return immediately.
This requires moving the reuseport_has_conns() logic into the callers.
While we are here, get rid of inline directives as they do not belong
in foo.c files.
The other changes were cases of more straightforward overlapping
modifications.
Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'riscv-for-linus-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux into master
Pull RISC-V fixes from Palmer Dabbelt:
"A few more fixes this week:
- A fix to avoid using SBI calls during kasan initialization, as the
SBI calls themselves have not been probed yet.
- Three fixes related to systems with multiple memory regions"
* tag 'riscv-for-linus-5.8-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Parse all memory blocks to remove unusable memory
RISC-V: Do not rely on initrd_start/end computed during early dt parsing
RISC-V: Set maximum number of mapped pages correctly
riscv: kasan: use local_tlb_flush_all() to avoid uninitialized __sbi_rfence
Merge tag 'x86-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull x86 fixes from Ingo Molnar:
"Misc fixes:
- Fix a section end page alignment assumption that was causing
crashes
- Fix ORC unwinding on freshly forked tasks which haven't executed
yet and which have empty user task stacks
- Fix the debug.exception-trace=1 sysctl dumping of user stacks,
which was broken by recent maccess changes"
* tag 'x86-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/dumpstack: Dump user space code correctly again
x86/stacktrace: Fix reliable check for empty user task stacks
x86/unwind/orc: Fix ORC for newly forked tasks
x86, vmlinux.lds: Page-align end of ..page_aligned sections
Merge tag 'perf-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull uprobe fix from Ingo Molnar:
"Fix an interaction/regression between uprobes based shared library
tracing & GDB"
* tag 'perf-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
uprobes: Change handle_swbp() to send SIGTRAP with si_code=SI_KERNEL, to fix GDB regression
Merge tag 'timers-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull timer fix from Ingo Molnar:
"Fix a suspend/resume regression (crash) on TI AM3/AM4 SoC's"
* tag 'timers-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
clocksource/drivers/timer-ti-dm: Fix suspend and resume for am3 and am4
Merge tag 'sched-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull scheduler fixes from Ingo Molnar:
"Fix a race introduced by the recent loadavg race fix, plus add a debug
check for a hard to debug case of bogus wakeup function flags"
* tag 'sched-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Warn if garbage is passed to default_wake_function()
sched: Fix race against ptrace_freeze_trace()
Merge tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip into master
Pull EFI fixes from Ingo Molnar:
"Various EFI fixes:
- Fix the layering violation in the use of the EFI runtime services
availability mask in users of the 'efivars' abstraction
- Revert build fix for GCC v4.8 which is no longer supported
- Clean up some x86 EFI stub details, some of which are borderline
bugs that copy around garbage into padding fields - let's fix these
out of caution.
- Fix build issues while working on RISC-V support
- Avoid --whole-archive when linking the stub on arm64"
* tag 'efi-urgent-2020-07-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
efi: Revert "efi/x86: Fix build with gcc 4"
efi/efivars: Expose RT service availability via efivars abstraction
efi/libstub: Move the function prototypes to header file
efi/libstub: Fix gcc error around __umoddi3 for 32 bit builds
efi/libstub/arm64: link stub lib.a conditionally
efi/x86: Only copy upto the end of setup_header
efi/x86: Remove unused variables
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net into master
Pull networking fixes from David Miller:
1) Fix RCU locaking in iwlwifi, from Johannes Berg.
2) mt76 can access uninitialized NAPI struct, from Felix Fietkau.
3) Fix race in updating pause settings in bnxt_en, from Vasundhara
Volam.
4) Propagate error return properly during unbind failures in ax88172a,
from George Kennedy.
5) Fix memleak in adf7242_probe, from Liu Jian.
6) smc_drv_probe() can leak, from Wang Hai.
7) Don't muck with the carrier state if register_netdevice() fails in
the bonding driver, from Taehee Yoo.
8) Fix memleak in dpaa_eth_probe, from Liu Jian.
9) Need to check skb_put_padto() return value in hsr_fill_tag(), from
Murali Karicheri.
10) Don't lose ionic RSS hash settings across FW update, from Shannon
Nelson.
11) Fix clobbered SKB control block in act_ct, from Wen Xu.
12) Missing newlink in "tx_timeout" sysfs output, from Xiongfeng Wang.
13) IS_UDPLITE cleanup a long time ago, incorrectly handled
transformations involving UDPLITE_RECV_CC. From Miaohe Lin.
14) Unbalanced locking in netdevsim, from Taehee Yoo.
15) Suppress false-positive error messages in qed driver, from Alexander
Lobakin.
16) Out of bounds read in ax25_connect and ax25_sendmsg, from Peilin Ye.
17) Missing SKB release in cxgb4's uld_send(), from Navid Emamdoost.
18) Uninitialized value in geneve_changelink(), from Cong Wang.
19) Fix deadlock in xen-netfront, from Andera Righi.
19) flush_backlog() frees skbs with IRQs disabled, so should use
dev_kfree_skb_irq() instead of kfree_skb(). From Subash Abhinov
Kasiviswanathan.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits)
drivers/net/wan: lapb: Corrected the usage of skb_cow
dev: Defer free of skbs in flush_backlog
qrtr: orphan socket in qrtr_release()
xen-netfront: fix potential deadlock in xennet_remove()
flow_offload: Move rhashtable inclusion to the source file
geneve: fix an uninitialized value in geneve_changelink()
bonding: check return value of register_netdevice() in bond_newlink()
tcp: allow at most one TLP probe per flight
AX.25: Prevent integer overflows in connect and sendmsg
cxgb4: add missing release on skb in uld_send()
net: atlantic: fix PTP on AQC10X
AX.25: Prevent out-of-bounds read in ax25_sendmsg()
sctp: shrink stream outq when fails to do addstream reconf
sctp: shrink stream outq only when new outcnt < old outcnt
AX.25: Fix out-of-bounds read in ax25_connect()
enetc: Remove the mdio bus on PF probe bailout
net: ethernet: ti: add NETIF_F_HW_TC hw feature flag for taprio offload
net: ethernet: ave: Fix error returns in ave_init
drivers/net/wan/x25_asy: Fix to make it work
ipvs: fix the connection sync failed in some cases
...
riscv: Parse all memory blocks to remove unusable memory
Currently, maximum physical memory allowed is equal to -PAGE_OFFSET.
That's why we remove any memory blocks spanning beyond that size. However,
it is done only for memblock containing linux kernel which will not work
if there are multiple memblocks.
Process all memory blocks to figure out how much memory needs to be removed
and remove at the end instead of updating the memblock list in place.
RISC-V: Do not rely on initrd_start/end computed during early dt parsing
Currently, initrd_start/end are computed during early_init_dt_scan
but used during arch_setup. We will get the following panic if initrd is used
and CONFIG_DEBUG_VIRTUAL is turned on.
To avoid the error, initrd_start/end can be computed from phys_initrd_start/size
in setup itself. It also improves the initrd placement by aligning the start
and size with the page size.
Fixes: 76d2a0493a17 ("RISC-V: Init and Halt Code") Signed-off-by: Atish Patra <atish.patra@wdc.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Xie He [Fri, 24 Jul 2020 16:33:47 +0000 (09:33 -0700)]
drivers/net/wan: lapb: Corrected the usage of skb_cow
This patch fixed 2 issues with the usage of skb_cow in LAPB drivers
"lapbether" and "hdlc_x25":
1) After skb_cow fails, kfree_skb should be called to drop a reference
to the skb. But in both drivers, kfree_skb is not called.
2) skb_cow should be called before skb_push so that is can ensure the
safety of skb_push. But in "lapbether", it is incorrectly called after
skb_push.
More details about these 2 issues:
1) The behavior of calling kfree_skb on failure is also the behavior of
netif_rx, which is called by this function with "return netif_rx(skb);".
So this function should follow this behavior, too.
2) In "lapbether", skb_cow is called after skb_push. This results in 2
logical issues:
a) skb_push is not protected by skb_cow;
b) An extra headroom of 1 byte is ensured after skb_push. This extra
headroom has no use in this function. It also has no use in the
upper-layer function that this function passes the skb to
(x25_lapb_receive_frame in net/x25/x25_dev.c).
So logically skb_cow should instead be called before skb_push.
Cc: Eric Dumazet <edumazet@google.com> Cc: Martin Schiller <ms@dev.tdt.de> Signed-off-by: Xie He <xie.he.0141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 25 Jul 2020 03:03:28 +0000 (20:03 -0700)]
Merge branch 'net-dsa-mv88e6xxx-port-mtu-support'
Chris Packham says:
====================
net: dsa: mv88e6xxx: port mtu support
This series connects up the mv88e6xxx switches to the dsa infrastructure for
configuring the port MTU. The first patch is also a bug fix which might be a
candiatate for stable.
I've rebased this series on top of net-next/master to pick up Andrew's change
for the gigabit switches. Patch 1 and 2 are unchanged (aside from adding
Andrew's Reviewed-by). Patch 3 is reworked to make use of the existing mtu
support.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Packham [Thu, 23 Jul 2020 23:21:22 +0000 (11:21 +1200)]
net: dsa: mv88e6xxx: Use chip-wide max frame size for MTU
Some of the chips in the mv88e6xxx family don't support jumbo
configuration per port. But they do have a chip-wide max frame size that
can be used. Use this to approximate the behaviour of configuring a port
based MTU.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Packham [Thu, 23 Jul 2020 23:21:21 +0000 (11:21 +1200)]
net: dsa: mv88e6xxx: Support jumbo configuration on 6190/6190X
The MV88E6190 and MV88E6190X both support per port jumbo configuration
just like the other GE switches. Install the appropriate ops.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Packham [Thu, 23 Jul 2020 23:21:20 +0000 (11:21 +1200)]
net: dsa: mv88e6xxx: MV88E6097 does not support jumbo configuration
The MV88E6097 chip does not support configuring jumbo frames. Prior to
commit 5f4366660d65 only the 6352, 6351, 6165 and 6320 chips configured
jumbo mode. The refactor accidentally added the function for the 6097.
Remove the erroneous function pointer assignment.
Fixes: 5f4366660d65 ("net: dsa: mv88e6xxx: Refactor setting of jumbo frames") Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
IRQs are disabled when freeing skbs in input queue.
Use the IRQ safe variant to free skbs here.
Fixes: 145dd5f9c88f ("net: flush the softnet backlog in process context") Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
RISC-V: Set maximum number of mapped pages correctly
Currently, maximum number of mapper pages are set to the pfn calculated
from the memblock size of the memblock containing kernel. This will work
until that memblock spans the entire memory. However, it will be set to
a wrong value if there are multiple memblocks defined in kernel
(e.g. with efi runtime services).
Set the the maximum value to the pfn calculated from dram size.
Merge tag 'pci-v5.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci into master
Pull PCI fixes from Bjorn Helgaas:
- Reject invalid IRQ 0 command line argument for virtio_mmio because
IRQ 0 now generates warnings (Bjorn Helgaas)
- Revert "PCI/PM: Assume ports without DLL Link Active train links in
100 ms", which broke nouveau (Bjorn Helgaas)
* tag 'pci-v5.8-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
Revert "PCI/PM: Assume ports without DLL Link Active train links in 100 ms"
virtio-mmio: Reject invalid IRQ 0 command line argument
Cong Wang [Fri, 24 Jul 2020 16:45:51 +0000 (09:45 -0700)]
qrtr: orphan socket in qrtr_release()
We have to detach sock from socket in qrtr_release(),
otherwise skb->sk may still reference to this socket
when the skb is released in tun->queue, particularly
sk->sk_wq still points to &sock->wq, which leads to
a UAF.
Reported-and-tested-by: syzbot+6720d64f31c081c2f708@syzkaller.appspotmail.com Fixes: 28fb4e59a47d ("net: qrtr: Expose tunneling endpoint to user space") Cc: Bjorn Andersson <bjorn.andersson@linaro.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Hai [Fri, 24 Jul 2020 13:46:30 +0000 (21:46 +0800)]
net: hix5hd2_gmac: Remove unneeded cast from memory allocation
Remove casting the values returned by memory allocation function.
Coccinelle emits WARNING:
./drivers/net/ethernet/hisilicon/hix5hd2_gmac.c:1027:9-23: WARNING:
casting value returned by memory allocation function to (struct sg_desc *) is useless.
This issue was detected by using the Coccinelle software.
Signed-off-by: Wang Hai <wanghai38@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
l2tp hasn't been kept up to date with the static analysis checks offered
by checkpatch.pl.
This patchset builds on the series: "l2tp: cleanup checkpatch.pl
warnings" and "l2tp: further checkpatch.pl cleanups" to resolve some of
the remaining checkpatch warnings in l2tp.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Parkin [Fri, 24 Jul 2020 15:31:57 +0000 (16:31 +0100)]
l2tp: WARN_ON rather than BUG_ON in l2tp_session_free
l2tp_session_free called BUG_ON if the tunnel magic feather value wasn't
correct. The intent of this was to catch lifetime bugs; for example
early tunnel free due to incorrect use of reference counts.
Since the tunnel magic feather being wrong indicates either early free
or structure corruption, we can avoid doing more damage by simply
leaving the tunnel structure alone. If the tunnel refcount isn't
dropped when it should be, the tunnel instance will remain in the
kernel, resulting in the tunnel structure and socket leaking.
Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Parkin [Fri, 24 Jul 2020 15:31:56 +0000 (16:31 +0100)]
l2tp: remove BUG_ON refcount value in l2tp_session_free
l2tp_session_free is only called by l2tp_session_dec_refcount when the
reference count reaches zero, so it's of limited value to validate the
reference count value in l2tp_session_free itself.
Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Parkin [Fri, 24 Jul 2020 15:31:55 +0000 (16:31 +0100)]
l2tp: WARN_ON rather than BUG_ON in l2tp_session_queue_purge
l2tp_session_queue_purge is used during session shutdown to drop any
skbs queued for reordering purposes according to L2TP dataplane rules.
The BUG_ON in this function checks the session magic feather in an
attempt to catch lifetime bugs.
Rather than crashing the kernel with a BUG_ON, we can simply WARN_ON and
refuse to do anything more -- in the worst case this could result in a
leak. However this is highly unlikely given that the session purge only
occurs from codepaths which have obtained the session by means of a lookup
via. the parent tunnel and which check the session "dead" flag to
protect against shutdown races.
While we're here, have l2tp_session_queue_purge return void rather than
an integer, since neither of the callsites checked the return value.
Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>