Yonghong Song [Sun, 20 Mar 2022 03:20:09 +0000 (20:20 -0700)]
bpftool: Fix a bug in subskeleton code generation
Compiled with clang by adding LLVM=1 both kernel and selftests/bpf
build, I hit the following compilation error:
In file included from /.../tools/testing/selftests/bpf/prog_tests/subskeleton.c:6:
./test_subskeleton_lib.subskel.h:168:6: error: variable 'err' is used uninitialized whenever
'if' condition is true [-Werror,-Wsometimes-uninitialized]
if (!s->progs)
^~~~~~~~~
./test_subskeleton_lib.subskel.h:181:11: note: uninitialized use occurs here
errno = -err;
^~~
./test_subskeleton_lib.subskel.h:168:2: note: remove the 'if' if its condition is always false
if (!s->progs)
^~~~~~~~~~~~~~
The compilation error is triggered by the following code
...
int err;
err:
test_subskeleton_lib__destroy(obj);
errno = -err;
...
in test_subskeleton_lib__open(). The 'err' is not initialized, yet it
is used in 'errno = -err' later.
The fix is to remove 'errno = -err' since errno has been set properly
in all incoming branches.
Song Liu [Mon, 21 Mar 2022 18:00:08 +0000 (11:00 -0700)]
bpf: Fix bpf_prog_pack for multi-node setup
module_alloc requires num_online_nodes * PMD_SIZE to allocate huge pages.
bpf_prog_pack uses pack of size num_online_nodes * PMD_SIZE.
OTOH, module_alloc returns addresses that are PMD_SIZE aligned (instead of
num_online_nodes * PMD_SIZE aligned). Therefore, PMD_MASK should be used
to calculate pack_ptr in bpf_prog_pack_free().
Fixes: 1ca2b2ac2c72 ("bpf: Select proper size for bpf_prog_pack") Reported-by: syzbot+c946805b5ce6ab87df0b@syzkaller.appspotmail.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220321180009.1944482-2-song@kernel.org
Jiri Olsa [Mon, 21 Mar 2022 07:01:13 +0000 (08:01 +0100)]
bpf: Fix kprobe_multi return probe backtrace
Andrii reported that backtraces from kprobe_multi program attached
as return probes are not complete and showing just initial entry [1].
It's caused by changing registers to have original function ip address
as instruction pointer even for return probe, which will screw backtrace
from return probe.
This change keeps registers intact and store original entry ip and
link address on the stack in bpf_kprobe_multi_run_ctx struct, where
bpf_get_func_ip and bpf_get_attach_cookie helpers for kprobe_multi
programs can find it.
Hangbin Liu [Mon, 21 Mar 2022 02:41:49 +0000 (10:41 +0800)]
selftests/bpf/test_lirc_mode2.sh: Exit with proper code
When test_lirc_mode2_user exec failed, the test report failed but still
exit with 0. Fix it by exiting with an error code.
Another issue is for the LIRCDEV checking. With bash -n, we need to quote
the variable, or it will always be true. So if test_lirc_mode2_user was
not run, just exit with skip code.
bpf: Check for NULL return from bpf_get_btf_vmlinux
When CONFIG_DEBUG_INFO_BTF is disabled, bpf_get_btf_vmlinux can return a
NULL pointer. Check for it in btf_get_module_btf to prevent a NULL pointer
dereference.
While kernel test robot only complained about this specific case, let's
also check for NULL in other call sites of bpf_get_btf_vmlinux.
Namhyung Kim [Mon, 14 Mar 2022 18:20:42 +0000 (11:20 -0700)]
selftests/bpf: Test skipping stacktrace
Add a test case for stacktrace with skip > 0 using a small sized
buffer. It didn't support skipping entries greater than or equal to
the size of buffer and filled the skipped part with 0.
Let's say that the caller has storage for num_elem stack frames. Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip > 0, one keeps only 'num_elem - skip' frames.
This is because it sets init_nr in the perf_callchain_entry to the end
of the buffer to save num_elem entries only. I believe it was because
the perf callchain code unwound the stack frames until it reached the
global max size (sysctl_perf_event_max_stack).
However it now has perf_callchain_entry_ctx.max_stack to limit the
iteration locally. This simplifies the code to handle init_nr in the
BPF callstack entries and removes the confusion with the perf_event's
__PERF_SAMPLE_CALLCHAIN_EARLY which sets init_nr to 0.
Also change the comment on bpf_get_stack() in the header file to be
more explicit what the return value means.
Song Liu [Fri, 11 Mar 2022 20:11:35 +0000 (12:11 -0800)]
bpf: Select proper size for bpf_prog_pack
Using HPAGE_PMD_SIZE as the size for bpf_prog_pack is not ideal in some
cases. Specifically, for NUMA systems, __vmalloc_node_range requires
PMD_SIZE * num_online_nodes() to allocate huge pages. Also, if the system
does not support huge pages (i.e., with cmdline option nohugevmalloc), it
is better to use PAGE_SIZE packs.
Add logic to select proper size for bpf_prog_pack. This solution is not
ideal, as it makes assumption about the behavior of module_alloc and
__vmalloc_node_range. However, it appears to be the easiest solution as
it doesn't require changes in module_alloc and vmalloc code.
Merge branch 'Make 2-byte access to bpf_sk_lookup->remote_port endian-agnostic'
Jakub Sitnicki says:
====================
This patch set is a result of a discussion we had around the RFC patchset from
Ilya [1]. The fix for the narrow loads from the RFC series is still relevant,
but this series does not depend on it. Nor is it required to unbreak sk_lookup
tests on BE, if this series gets applied.
To summarize the takeaways from [1]:
1) we want to make 2-byte load from ctx->remote_port portable across LE and BE,
2) we keep the 4-byte load from ctx->remote_port as it is today - result varies
on endianess of the platform.
Jakub Sitnicki [Sat, 19 Mar 2022 18:33:56 +0000 (19:33 +0100)]
selftests/bpf: Fix test for 4-byte load from remote_port on big-endian
The context access converter rewrites the 4-byte load from
bpf_sk_lookup->remote_port to a 2-byte load from bpf_sk_lookup_kern
structure.
It means that we cannot treat the destination register contents as a 32-bit
value, or the code will not be portable across big- and little-endian
architectures.
This is exactly the same case as with 4-byte loads from bpf_sock->dst_port
so follow the approach outlined in [1] and treat the register contents as a
16-bit value in the test.
Fixes: 306ecb352f97 ("selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220319183356.233666-4-jakub@cloudflare.com
Above addresses map to ->remote_ip6 field, which precedes ->remote_port,
and are populated during the bpf_sk_lookup IPv6 tests.
Unsurprisingly, on s390x we observe:
#136/38 sk_lookup/narrow access to ctx v4:OK
#136/39 sk_lookup/narrow access to ctx v6:FAIL
Fix it by removing the checks for 1-byte loads from offsets outside of the
->remote_port field.
Fixes: 9f98f541ae0b ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide") Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220319183356.233666-3-jakub@cloudflare.com
Jakub Sitnicki [Sat, 19 Mar 2022 18:33:54 +0000 (19:33 +0100)]
bpf: Treat bpf_sk_lookup remote_port as a 2-byte field
In commit 9f98f541ae0b ("bpf: Make remote_port field in struct
bpf_sk_lookup 16-bit wide") the remote_port field has been split up and
re-declared from u32 to be16.
However, the accompanying changes to the context access converter have not
been well thought through when it comes big-endian platforms.
Today 2-byte wide loads from offsetof(struct bpf_sk_lookup, remote_port)
are handled as narrow loads from a 4-byte wide field.
This by itself is not enough to create a problem, but when we combine
1. 32-bit wide access to ->remote_port backed by a 16-wide wide load, with
2. inherent difference between litte- and big-endian in how narrow loads
need have to be handled (see bpf_ctx_narrow_access_offset),
we get inconsistent results for a 2-byte loads from &ctx->remote_port on LE
and BE architectures. This in turn makes BPF C code for the common case of
2-byte load from ctx->remote_port not portable.
To rectify it, inform the context access converter that remote_port is
2-byte wide field, and only 1-byte loads need to be treated as narrow
loads.
At the same time, we special-case the 4-byte load from &ctx->remote_port to
continue handling it the same way as do today, in order to keep the
existing BPF programs working.
Fixes: 9f98f541ae0b ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220319183356.233666-2-jakub@cloudflare.com
Merge branch 'Enable non-atomic allocations in local storage'
Joanne Koong says:
====================
From: Joanne Koong <joannelkoong@gmail.com>
Currently, local storage memory can only be allocated atomically
(GFP_ATOMIC). This restriction is too strict for sleepable bpf
programs.
In this patchset, sleepable programs can allocate memory in local
storage using GFP_KERNEL, while non-sleepable programs always default to
GFP_ATOMIC.
v3 <- v2:
* Add extra case to local_storage.c selftest to test associating multiple
elements with the local storage, which triggers a GFP_KERNEL allocation in
local_storage_update().
* Cast gfp_t to __s32 in verifier to fix the sparse warnings
v2 <- v1:
* Allocate the memory before/after the raw_spin_lock_irqsave, depending
on the gfp flags
* Rename mem_flags to gfp_flags
* Reword the comment "*mem_flags* is set by the bpf verifier" to
"*gfp_flags* is a hidden argument provided by the verifier"
* Add a sentence to the commit message about existing local storage
selftests covering both the GFP_ATOMIC and GFP_KERNEL paths in
bpf_local_storage_update.
====================
Joanne Koong [Fri, 18 Mar 2022 04:55:52 +0000 (21:55 -0700)]
bpf: Enable non-atomic allocations in local storage
Currently, local storage memory can only be allocated atomically
(GFP_ATOMIC). This restriction is too strict for sleepable bpf
programs.
In this patch, the verifier detects whether the program is sleepable,
and passes the corresponding GFP_KERNEL or GFP_ATOMIC flag as a
5th argument to bpf_task/sk/inode_storage_get. This flag will propagate
down to the local storage functions that allocate memory.
Please note that bpf_task/sk/inode_storage_update_elem functions are
invoked by userspace applications through syscalls. Preemption is
disabled before bpf_task/sk/inode_storage_update_elem is called, which
means they will always have to allocate memory atomically.
Align it with helpers like bpf_find_btf_id, so all functions returning
BTF in out parameter follow the same rule of raising reference
consistently, regardless of module or vmlinux BTF.
Adjust existing callers to handle the change accordinly.
bpf: Factor out fd returning from bpf_btf_find_by_name_kind
In next few patches, we need a helper that searches all kernel BTFs
(vmlinux and module BTFs), and finds the type denoted by 'name' and
'kind'. Turns out bpf_btf_find_by_name_kind already does the same thing,
but it instead returns a BTF ID and optionally fd (if module BTF). This
is used for relocating ksyms in BPF loader code (bpftool gen skel -L).
We extract the core code out into a new helper bpf_find_btf_id, which
returns the BTF ID in the return value, and BTF pointer in an out
parameter. The reference for the returned BTF pointer is always raised,
hence user must either transfer it (e.g. to a fd), or release it after
use.
Daniel Borkmann [Fri, 18 Mar 2022 14:46:59 +0000 (15:46 +0100)]
Merge branch 'bpf-fix-sock-field-tests'
Jakub Sitnicki says:
====================
I think we have reached a consensus [1] on how the test for the 4-byte load from
bpf_sock->dst_port and bpf_sk_lookup->remote_port should look, so here goes v3.
I will submit a separate set of patches for bpf_sk_lookup->remote_port tests.
This series has been tested on x86_64 and s390 on top of recent bpf-next - 1e3b4e6dd82c ("selftests/bpf: Test subprog jit when toggle bpf_jit_harden
repeatedly").
v2 -> v3:
- Split what was previously patch 2 which was doing two things
- Use BPF_TCP_* constants (Martin)
- Treat the result of 4-byte load from dst_port as a 16-bit value (Martin)
- Typo fixup and some rewording in patch 4 description
v1 -> v2:
- Limit read_sk_dst_port only to client traffic (patch 2)
- Make read_sk_dst_port pass on litte- and big-endian (patch 3)
Jakub Sitnicki [Thu, 17 Mar 2022 11:39:20 +0000 (12:39 +0100)]
selftests/bpf: Fix test for 4-byte load from dst_port on big-endian
The check for 4-byte load from dst_port offset into bpf_sock is failing on
big-endian architecture - s390. The bpf access converter rewrites the
4-byte load to a 2-byte load from sock_common at skc_dport offset, as shown
below.
Jakub Sitnicki [Thu, 17 Mar 2022 11:39:18 +0000 (12:39 +0100)]
selftests/bpf: Check dst_port only on the client socket
cgroup_skb/egress programs which sock_fields test installs process packets
flying in both directions, from the client to the server, and in reverse
direction.
Recently added dst_port check relies on the fact that destination
port (remote peer port) of the socket which sends the packet is known ahead
of time. This holds true only for the client socket, which connects to the
known server port.
Filter out any traffic that is not egressing from the client socket in the
BPF program that tests reading the dst_port.
Fixes: eea83c039b64 ("selftests/bpf: Extend verifier and bpf_sock tests for dst_port loads") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220317113920.1068535-3-jakub@cloudflare.com
Jakub Sitnicki [Thu, 17 Mar 2022 11:39:17 +0000 (12:39 +0100)]
selftests/bpf: Fix error reporting from sock_fields programs
The helper macro that records an error in BPF programs that exercise sock
fields access has been inadvertently broken by adaptation work that
happened in commit 7333fc0592a2 ("bpf: selftest: Adapt sock_fields test to
use skel and global variables").
BPF_NOEXIST flag cannot be used to update BPF_MAP_TYPE_ARRAY. The operation
always fails with -EEXIST, which in turn means the error never gets
recorded, and the checks for errors always pass.
Revert the change in update flags.
Fixes: 7333fc0592a2 ("bpf: selftest: Adapt sock_fields test to use skel and global variables") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220317113920.1068535-2-jakub@cloudflare.com
Andrii Nakryiko [Fri, 18 Mar 2022 06:11:16 +0000 (23:11 -0700)]
Merge branch 'Subskeleton support for BPF librariesThread-Topic: [PATCH bpf-next v4 0/5'
Delyan Kratunov says:
====================
In the quest for ever more modularity, a new need has arisen - the ability to
access data associated with a BPF library from a corresponding userspace library.
The catch is that we don't want the userspace library to know about the structure of the
final BPF object that the BPF library is linked into.
In pursuit of this modularity, this patch series introduces *subskeletons.*
Subskeletons are similar in use and design to skeletons with a couple of differences:
1. The generated storage types do not rely on contiguous storage for the library's
variables because they may be interspersed randomly throughout the final BPF object's sections.
2. Subskeletons do not own objects and instead require a loaded bpf_object* to
be passed at runtime in order to be initialized. By extension, symbols are resolved at
runtime by parsing the final object's BTF.
3. Subskeletons allow access to all global variables, programs, and custom maps. They also expose
the internal maps *of the final object*. This allows bpf_var_skeleton objects to contain a bpf_map**
instead of a section name.
Changes since v3:
- Re-add key/value type lookup for legacy user maps (fixing btf test)
- Minor cleanups (missed sanitize_identifier call, error messages, formatting)
Changes since v2:
- Reuse SEC_NAME strict mode flag
- Init bpf_map->btf_value_type_id on open for internal maps *and* user BTF maps
- Test custom section names (.data.foo) and overlapping kconfig externs between the final object and the library
- Minor review comments in gen.c & libbpf.c
Changes since v1:
- Introduced new strict mode knob for single-routine-in-.text compatibility behavior, which
disproportionately affects library objects. bpftool works in 1.0 mode so subskeleton generation
doesn't have to worry about this now.
- Made bpf_map_btf_value_type_id available earlier and used it wherever applicable.
- Refactoring in bpftool gen.c per review comments.
- Subskels now use typeof() for array and func proto globals to avoid the need for runtime split btf.
- Expanded the subskeleton test to include arrays, custom maps, extern maps, weak symbols, and kconfigs.
- selftests/bpf/Makefile now generates a subskel.h for every skel.h it would make.
For reference, here is a shortened subskeleton header:
Delyan Kratunov [Wed, 16 Mar 2022 23:37:28 +0000 (23:37 +0000)]
bpftool: Add support for subskeletons
Subskeletons are headers which require an already loaded program to
operate.
For example, when a BPF library is linked into a larger BPF object file,
the library userspace needs a way to access its own global variables
without requiring knowledge about the larger program at build time.
As a result, subskeletons require a loaded bpf_object to open().
Further, they find their own symbols in the larger program by
walking BTF type data at run time.
At this time, programs, maps, and globals are supported through
non-owning pointers.
Delyan Kratunov [Wed, 16 Mar 2022 23:37:33 +0000 (23:37 +0000)]
libbpf: Add subskeleton scaffolding
In symmetry with bpf_object__open_skeleton(),
bpf_object__open_subskeleton() performs the actual walking and linking
of maps, progs, and globals described by bpf_*_skeleton objects.
Delyan Kratunov [Wed, 16 Mar 2022 23:37:30 +0000 (23:37 +0000)]
libbpf: Init btf_{key,value}_type_id on internal map open
For internal and user maps, look up the key and value btf
types on open() and not load(), so that `bpf_map_btf_value_type_id`
is usable in `bpftool gen`.
Delyan Kratunov [Wed, 16 Mar 2022 23:37:24 +0000 (23:37 +0000)]
libbpf: .text routines are subprograms in strict mode
Currently, libbpf considers a single routine in .text to be a program. This
is particularly confusing when it comes to library objects - a single routine
meant to be used as an extern will instead be considered a bpf_program.
This patch hides this compatibility behavior behind the pre-existing
SEC_NAME strict mode flag.
hi,
this patchset adds new link type BPF_TRACE_KPROBE_MULTI that attaches
kprobe program through fprobe API [1] instroduced by Masami.
The fprobe API allows to attach probe on multiple functions at once very
fast, because it works on top of ftrace. On the other hand this limits
the probe point to the function entry or return.
With bpftrace support I see following attach speed:
1.4960 +- 0.0285 seconds time elapsed ( +- 1.91% )
v3 changes:
- based on latest fprobe post from Masami [2]
- add acks
- add extra comment to kprobe_multi_link_handler wrt entry ip setup [Masami]
- keep swap_words_64 static and swap values directly in
bpf_kprobe_multi_cookie_swap [Andrii]
- rearrange locking/migrate setup in kprobe_multi_link_prog_run [Andrii]
- move uapi fields [Andrii]
- add bpf_program__attach_kprobe_multi_opts function [Andrii]
- many small test changes [Andrii]
- added tests for bpf_program__attach_kprobe_multi_opts
- make kallsyms_lookup_name check for empty string [Andrii]
v2 changes:
- based on latest fprobe changes [1]
- renaming the uapi interface to kprobe multi
- adding support for sort_r to pass user pointer for swap functions
and using that in cookie support to keep just single functions array
- moving new link to kernel/trace/bpf_trace.c file
- using single fprobe callback function for entry and exit
- using kvzalloc, libbpf_ensure_mem functions
- adding new k[ret]probe.multi sections instead of using current kprobe
- used glob_match from test_progs.c, added '?' matching
- move bpf_get_func_ip verifier inline change to seprate change
- couple of other minor fixes
Also available at:
https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
bpf/kprobe_multi
Jiri Olsa [Wed, 16 Mar 2022 12:24:16 +0000 (13:24 +0100)]
selftests/bpf: Add kprobe_multi attach test
Adding kprobe_multi attach test that uses new fprobe interface to
attach kprobe program to multiple functions.
The test is attaching programs to bpf_fentry_test* functions and
uses single trampoline program bpf_prog_test_run to trigger
bpf_fentry_test* functions.
User can specify functions to attach with 'pattern' argument that
allows wildcards (*?' supported) or provide symbols or addresses
directly through opts argument. These 3 options are mutually
exclusive.
When using symbols or addresses, user can also provide cookie value
for each symbol/address that can be retrieved later in bpf program
with bpf_get_attach_cookie helper.
Jiri Olsa [Wed, 16 Mar 2022 12:24:12 +0000 (13:24 +0100)]
bpf: Add cookie support to programs attached with kprobe multi link
Adding support to call bpf_get_attach_cookie helper from
kprobe programs attached with kprobe multi link.
The cookie is provided by array of u64 values, where each
value is paired with provided function address or symbol
with the same array index.
When cookie array is provided it's sorted together with
addresses (check bpf_kprobe_multi_cookie_swap). This way
we can find cookie based on the address in
bpf_get_attach_cookie helper.
Merge branch 'fprobe: Introduce fprobe function entry/exit probe'
Masami Hiramatsu says:
====================
Hi,
Here is the 12th version of fprobe. This version fixes a possible gcc-11 issue which
was reported as kretprobes on arm issue, and also I updated the fprobe document.
This series introduces the fprobe, the function entry/exit probe
with multiple probe point support for x86, arm64 and powerpc64le.
This also introduces the rethook for hooking function return as same as
the kretprobe does. This abstraction will help us to generalize the fgraph
tracer, because we can just switch to it from the rethook in fprobe,
depending on the kernel configuration.
And the patch [9/10] adds the FPROBE_FL_KPROBE_SHARED flag for the case
if user wants to share the same code (or share a same resource) on the
fprobe and the kprobes.
I forcibly updated my kprobes/fprobe branch, you can pull this series
from:
Jiri Olsa [Wed, 16 Mar 2022 12:24:09 +0000 (13:24 +0100)]
bpf: Add multi kprobe link
Adding new link type BPF_LINK_TYPE_KPROBE_MULTI that attaches kprobe
program through fprobe API.
The fprobe API allows to attach probe on multiple functions at once
very fast, because it works on top of ftrace. On the other hand this
limits the probe point to the function entry or return.
The kprobe program gets the same pt_regs input ctx as when it's attached
through the perf API.
Adding new attach type BPF_TRACE_KPROBE_MULTI that allows attachment
kprobe to multiple function with new link.
User provides array of addresses or symbols with count to attach the
kprobe program to. The new link_create uapi interface looks like:
Masami Hiramatsu [Tue, 15 Mar 2022 14:02:11 +0000 (23:02 +0900)]
fprobe: Introduce FPROBE_FL_KPROBE_SHARED flag for fprobe
Introduce FPROBE_FL_KPROBE_SHARED flag for sharing fprobe callback with
kprobes safely from the viewpoint of recursion.
Since the recursion safety of the fprobe (and ftrace) is a bit different
from the kprobes, this may cause an issue if user wants to run the same
code from the fprobe and the kprobes.
The kprobes has per-cpu 'current_kprobe' variable which protects the
kprobe handler from recursion in any case. On the other hand, the fprobe
uses only ftrace_test_recursion_trylock(), which will allow interrupt
context calls another (or same) fprobe during the fprobe user handler is
running.
This is not a matter in cases if the common callback shared among the
kprobes and the fprobe has its own recursion detection, or it can handle
the recursion in the different contexts (normal/interrupt/NMI.)
But if it relies on the 'current_kprobe' recursion lock, it has to check
kprobe_running() and use kprobe_busy_*() APIs.
Fprobe has FPROBE_FL_KPROBE_SHARED flag to do this. If your common callback
code will be shared with kprobes, please set FPROBE_FL_KPROBE_SHARED
*before* registering the fprobe, like;
fprobe.flags = FPROBE_FL_KPROBE_SHARED;
register_fprobe(&fprobe, "func*", NULL);
This will protect your common callback from the nested call.
Masami Hiramatsu [Tue, 15 Mar 2022 14:02:00 +0000 (23:02 +0900)]
fprobe: Add sample program for fprobe
Add a sample program for the fprobe. The sample_fprobe puts a fprobe on
kernel_clone() by default. This dump stack and some called address info
at the function entry and exit.
The sample_fprobe.ko gets 2 parameters.
- symbol: you can specify the comma separated symbols or wildcard symbol
pattern (in this case you can not use comma)
- stackdump: a bool value to enable or disable stack dump in the fprobe
handler.
Masami Hiramatsu [Tue, 15 Mar 2022 14:01:48 +0000 (23:01 +0900)]
fprobe: Add exit_handler support
Add exit_handler to fprobe. fprobe + rethook allows us to hook the kernel
function return. The rethook will be enabled only if the
fprobe::exit_handler is set.
Masami Hiramatsu [Tue, 15 Mar 2022 14:01:36 +0000 (23:01 +0900)]
ARM: rethook: Add rethook arm implementation
Add rethook arm implementation. Most of the code has been copied from
kretprobes on arm.
Since the arm's ftrace implementation is a bit special, this needs a
special care using from fprobe.
Masami Hiramatsu [Tue, 15 Mar 2022 14:00:50 +0000 (23:00 +0900)]
rethook: Add a generic return hook
Add a return hook framework which hooks the function return. Most of the
logic came from the kretprobe, but this is independent from kretprobe.
Note that this is expected to be used with other function entry hooking
feature, like ftrace, fprobe, adn kprobes. Eventually this will replace
the kretprobe (e.g. kprobe + rethook = kretprobe), but at this moment,
this is just an additional hook.
Masami Hiramatsu [Tue, 15 Mar 2022 14:00:38 +0000 (23:00 +0900)]
fprobe: Add ftrace based probe APIs
The fprobe is a wrapper API for ftrace function tracer.
Unlike kprobes, this probes only supports the function entry, but this
can probe multiple functions by one fprobe. The usage is similar, user
will set their callback to fprobe::entry_handler and call
register_fprobe*() with probed functions.
There are 3 registration interfaces,
- register_fprobe() takes filtering patterns of the functin names.
- register_fprobe_ips() takes an array of ftrace-location addresses.
- register_fprobe_syms() takes an array of function names.
The registered fprobes can be unregistered with unregister_fprobe().
e.g.
Jiri Olsa [Tue, 15 Mar 2022 14:00:26 +0000 (23:00 +0900)]
ftrace: Add ftrace_set_filter_ips function
Adding ftrace_set_filter_ips function to be able to set filter on
multiple ip addresses at once.
With the kprobe multi attach interface we have cases where we need to
initialize ftrace_ops object with thousands of functions, so having
single function diving into ftrace_hash_move_and_update_ops with
ftrace_lock is faster.
The functions ips are passed as unsigned long array with count.
Kaixi Fan [Sun, 13 Mar 2022 16:41:16 +0000 (00:41 +0800)]
selftests/bpf: Fix tunnel remote IP comments
In namespace at_ns0, the IP address of tnl dev is 10.1.1.100 which is the
overlay IP, and the ip address of veth0 is 172.16.1.100 which is the vtep
IP. When doing 'ping 10.1.1.100' from root namespace, the remote_ip should
be 172.16.1.100.
Fixes: 3de8ce91aa0e ("selftests/bpf: bpf tunnel test.") Signed-off-by: Kaixi Fan <fankaixi.li@bytedance.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220313164116.5889-1-fankaixi.li@bytedance.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lorenzo Bianconi [Fri, 11 Mar 2022 09:14:19 +0000 (10:14 +0100)]
veth: Rework veth_xdp_rcv_skb in order to accept non-linear skb
Introduce veth_convert_skb_to_xdp_buff routine in order to
convert a non-linear skb into a xdp buffer. If the received skb
is cloned or shared, veth_convert_skb_to_xdp_buff will copy it
in a new skb composed by order-0 pages for the linear and the
fragmented area. Moreover veth_convert_skb_to_xdp_buff guarantees
we have enough headroom for xdp.
This is a preliminary patch to allow attaching xdp programs with frags
support on veth devices.
Lorenzo Bianconi [Fri, 11 Mar 2022 09:14:18 +0000 (10:14 +0100)]
net: veth: Account total xdp_frame len running ndo_xdp_xmit
Even if this is a theoretical issue since it is not possible to perform
XDP_REDIRECT on a non-linear xdp_frame, veth driver does not account
paged area in ndo_xdp_xmit function pointer.
Introduce xdp_get_frame_len utility routine to get the xdp_frame full
length and account total frame size running XDP_REDIRECT of a
non-linear xdp frame into a veth device.
Hou Tao [Wed, 9 Mar 2022 12:33:21 +0000 (20:33 +0800)]
selftests/bpf: Test subprog jit when toggle bpf_jit_harden repeatedly
When bpf_jit_harden is toggled between 0 and 2, subprog jit may fail
due to inconsistent twice read values of bpf_jit_harden during jit. So
add a test to ensure the problem is fixed.
Hou Tao [Wed, 9 Mar 2022 12:33:20 +0000 (20:33 +0800)]
bpf: Fix net.core.bpf_jit_harden race
It is the bpf_jit_harden counterpart to commit 9279b11084f4 ("bpf: fix
net.core.bpf_jit_enable race"). bpf_jit_harden will be tested twice
for each subprog if there are subprogs in bpf program and constant
blinding may increase the length of program, so when running
"./test_progs -t subprogs" and toggling bpf_jit_harden between 0 and 2,
jit_subprogs may fail because constant blinding increases the length
of subprog instructions during extra passs.
So cache the value of bpf_jit_blinding_enabled() during program
allocation, and use the cached value during constant blinding, subprog
JITing and args tracking of tail call.
Hou Tao [Wed, 9 Mar 2022 12:33:18 +0000 (20:33 +0800)]
bpf, x86: Fall back to interpreter mode when extra pass fails
Extra pass for subprog jit may fail (e.g. due to bpf_jit_harden race),
but bpf_func is not cleared for the subprog and jit_subprogs will
succeed. The running of the bpf program may lead to oops because the
memory for the jited subprog image has already been freed.
So fall back to interpreter mode by clearing bpf_func/jited/jited_len
when extra pass fails.
Merge branch 'Remove libcap dependency from bpf selftests'
Martin KaFai Lau says:
====================
After upgrading to the newer libcap (>= 2.60),
the libcap commit aca076443591 ("Make cap_t operations thread safe.")
added a "__u8 mutex;" to the "struct _cap_struct". It caused a few byte
shift that breaks the assumption made in the "struct libcap" definition
in test_verifier.c.
This set is to remove the libcap dependency from the bpf selftests.
v2:
- Define CAP_PERFMON and CAP_BPF when the older <linux/capability.h>
does not have them. (Andrii)
====================
Martin KaFai Lau [Wed, 16 Mar 2022 17:38:35 +0000 (10:38 -0700)]
bpf: selftests: Remove libcap usage from test_progs
This patch removes the libcap usage from test_progs.
bind_perm.c is the only user. cap_*_effective() helpers added in the
earlier patch are directly used instead.
No other selftest binary is using libcap, so '-lcap' is also removed
from the Makefile.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: Stanislav Fomichev <sdf@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220316173835.2039334-1-kafai@fb.com
Martin KaFai Lau [Wed, 16 Mar 2022 17:38:23 +0000 (10:38 -0700)]
bpf: selftests: Add helpers to directly use the capget and capset syscall
After upgrading to the newer libcap (>= 2.60),
the libcap commit aca076443591 ("Make cap_t operations thread safe.")
added a "__u8 mutex;" to the "struct _cap_struct". It caused a few byte
shift that breaks the assumption made in the "struct libcap" definition
in test_verifier.c.
The bpf selftest usage only needs to enable and disable the effective
caps of the running task. It is easier to directly syscall the
capget and capset instead. It can also remove the libcap
library dependency.
The cap_helpers.{c,h} is added. One __u64 is used for all CAP_*
bits instead of two __u32.
Dmitrii Dolgov [Wed, 9 Mar 2022 16:31:12 +0000 (17:31 +0100)]
bpftool: Add bpf_cookie to link output
Commit 09ab73af3ba2 ("bpf: Allow to specify user-provided bpf_cookie for
BPF perf links") introduced the concept of user specified bpf_cookie,
which could be accessed by BPF programs using bpf_get_attach_cookie().
For troubleshooting purposes it is convenient to expose bpf_cookie via
bpftool as well, so there is no need to meddle with the target BPF
program itself.
Implemented using the pid iterator BPF program to actually fetch
bpf_cookies, which allows constraining code changes only to bpftool.
$ bpftool link
1: type 7 prog 5
bpf_cookie 123
pids bootstrap(81)
Guo Zhengkui [Tue, 15 Mar 2022 13:01:26 +0000 (21:01 +0800)]
selftests/bpf: Clean up array_size.cocci warnings
Clean up the array_size.cocci warnings under tools/testing/selftests/bpf/:
Use `ARRAY_SIZE(arr)` instead of forms like `sizeof(arr)/sizeof(arr[0])`.
tools/testing/selftests/bpf/test_cgroup_storage.c uses ARRAY_SIZE() defined
in tools/include/linux/kernel.h (sys/sysinfo.h -> linux/kernel.h), while
others use ARRAY_SIZE() in bpf_util.h.
Niklas Söderlund [Tue, 15 Mar 2022 10:29:48 +0000 (11:29 +0100)]
samples/bpf, xdpsock: Fix race when running for fix duration of time
When running xdpsock for a fix duration of time before terminating
using --duration=<n>, there is a race condition that may cause xdpsock
to terminate immediately.
When running for a fixed duration of time the check to determine when to
terminate execution is in is_benchmark_done() and is being executed in
the context of the poller thread,
if (opt_duration > 0) {
unsigned long dt = (get_nsecs() - start_time);
if (dt >= opt_duration)
benchmark_done = true;
}
However start_time is only set after the poller thread have been
created. This leaves a small window when the poller thread is starting
and calls is_benchmark_done() for the first time that start_time is not
yet set. In that case start_time have its initial value of 0 and the
duration check fails as it do not correlate correctly for the
applications start time and immediately sets benchmark_done which in
turn terminates the xdpsock application.
Fix this by setting start_time before creating the poller thread.
Fixes: e7f6e63be8a9 ("samples/bpf: xdpsock: Add duration option to specify how long to run") Signed-off-by: Niklas Söderlund <niklas.soderlund@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220315102948.466436-1-niklas.soderlund@corigine.com
The mem of msg has been uncharged in tcp_bpf_send_verdict() by
sk_msg_return(), and would be uncharged by sk_msg_free() again. When psock
is null, we can simply returning an error code, this would then trigger
the sk_msg_free_nocharge in the error path of __SK_REDIRECT and would have
the side effect of throwing an error up to user space. This would be a
slight change in behavior from user side but would look the same as an
error if the redirect on the socket threw an error.
This issue can cause the following info:
WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260
Call Trace:
<TASK>
__sk_destruct+0x24/0x1f0
sk_psock_destroy+0x19b/0x1c0
process_one_work+0x1b3/0x3c0
worker_thread+0x30/0x350
? process_one_work+0x3c0/0x3c0
kthread+0xe6/0x110
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x22/0x30
</TASK>
Fixes: 38506f4bbc9d ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220304081145.2037182-5-wangyufen@huawei.com
Wang Yufen [Fri, 4 Mar 2022 08:11:43 +0000 (16:11 +0800)]
bpf, sockmap: Fix memleak in tcp_bpf_sendmsg while sk msg is full
If tcp_bpf_sendmsg() is running while sk msg is full. When sk_msg_alloc()
returns -ENOMEM error, tcp_bpf_sendmsg() goes to wait_for_memory. If partial
memory has been alloced by sk_msg_alloc(), that is, msg_tx->sg.size is
greater than osize after sk_msg_alloc(), memleak occurs. To fix we use
sk_msg_trim() to release the allocated memory, then goto wait for memory.
Other call paths of sk_msg_alloc() have the similar issue, such as
tls_sw_sendmsg(), so handle sk_msg_trim logic inside sk_msg_alloc(),
as Cong Wang suggested.
This issue can cause the following info:
WARNING: CPU: 3 PID: 7950 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0
Call Trace:
<TASK>
inet_csk_destroy_sock+0x55/0x110
__tcp_close+0x279/0x470
tcp_close+0x1f/0x60
inet_release+0x3f/0x80
__sock_release+0x3d/0xb0
sock_close+0x11/0x20
__fput+0x92/0x250
task_work_run+0x6a/0xa0
do_exit+0x33b/0xb60
do_group_exit+0x2f/0xa0
get_signal+0xb6/0x950
arch_do_signal_or_restart+0xac/0x2a0
exit_to_user_mode_prepare+0xa9/0x200
syscall_exit_to_user_mode+0x12/0x30
do_syscall_64+0x46/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
While drop_sk_msg(), the msg has charged memory form sk by sk_mem_charge
and has sg pages need to put. To fix we use sk_msg_free() and then kfee()
msg.
This issue can cause the following info:
WARNING: CPU: 0 PID: 9202 at net/core/stream.c:205 sk_stream_kill_queues+0xc8/0xe0
Call Trace:
<IRQ>
inet_csk_destroy_sock+0x55/0x110
tcp_rcv_state_process+0xe5f/0xe90
? sk_filter_trim_cap+0x10d/0x230
? tcp_v4_do_rcv+0x161/0x250
tcp_v4_do_rcv+0x161/0x250
tcp_v4_rcv+0xc3a/0xce0
ip_protocol_deliver_rcu+0x3d/0x230
ip_local_deliver_finish+0x54/0x60
ip_local_deliver+0xfd/0x110
? ip_protocol_deliver_rcu+0x230/0x230
ip_rcv+0xd6/0x100
? ip_local_deliver+0x110/0x110
__netif_receive_skb_one_core+0x85/0xa0
process_backlog+0xa4/0x160
__napi_poll+0x29/0x1b0
net_rx_action+0x287/0x300
__do_softirq+0xff/0x2fc
do_softirq+0x79/0x90
</IRQ>
Yonghong Song [Fri, 11 Mar 2022 00:37:21 +0000 (16:37 -0800)]
selftests/bpf: Fix a clang compilation error for send_signal.c
Building selftests/bpf with latest clang compiler (clang15 built
from source), I hit the following compilation error:
/.../prog_tests/send_signal.c:43:16: error: variable 'j' set but not used [-Werror,-Wunused-but-set-variable]
volatile int j = 0;
^
1 error generated.
The problem also exists with clang13 and clang14. clang12 is okay.
In send_signal.c, we have the following code ...
volatile int j = 0;
[...]
for (int i = 0; i < 100000000 && !sigusr1_received; i++)
j /= i + 1;
... to burn CPU cycles so bpf_send_signal() helper can be tested
in NMI mode.
Slightly changing 'j /= i + 1' to 'j /= i + j + 1' or 'j++' can
fix the problem. Further investigation indicated this should be
a clang bug ([1]). The upstream fix will be proposed later. But it
is a good idea to workaround the issue to unblock people who build
kernel/selftests with clang.
selftests/bpf: Add a test for maximum packet size in xdp_do_redirect
This adds an extra test to the xdp_do_redirect selftest for XDP live packet
mode, which verifies that the maximum permissible packet size is accepted
without any errors, and that a too big packet is correctly rejected.
bpf, test_run: Fix packet size check for live packet mode
The live packet mode uses some extra space at the start of each page to
cache data structures so they don't have to be rebuilt at every repetition.
This space wasn't correctly accounted for in the size checking of the
arguments supplied to userspace. In addition, the definition of the frame
size should include the size of the skb_shared_info (as there is other
logic that subtracts the size of this).
Together, these mistakes resulted in userspace being able to trip the
XDP_WARN() in xdp_update_frame_from_buff(), which syzbot discovered in
short order. Fix this by changing the frame size define and adding the
extra headroom to the bpf_prog_test_run_xdp() function. Also drop the
max_len parameter to the page_pool init, since this is related to DMA which
is not used for the page pool instance in PROG_TEST_RUN.
Fixes: c23b8c0ec3f7 ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Reported-by: syzbot+0e91362d99386dc5de99@syzkaller.appspotmail.com Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220310225621.53374-1-toke@redhat.com
Hao Luo [Thu, 10 Mar 2022 21:16:55 +0000 (13:16 -0800)]
compiler_types: Refactor the use of btf_type_tag attribute.
Previous patches have introduced the compiler attribute btf_type_tag for
__user and __percpu. The availability of this attribute depends on
some CONFIGs and compiler support. This patch refactors the use
of btf_type_tag by introducing BTF_TYPE_TAG, which hides all the
dependencies.
Merge branch 'bpf-lsm: Extend interoperability with IMA'
Roberto Sassu says:
====================
Extend the interoperability with IMA, to give wider flexibility for the
implementation of integrity-focused LSMs based on eBPF.
Patch 1 fixes some style issues.
Patches 2-6 give the ability to eBPF-based LSMs to take advantage of the
measurement capability of IMA without needing to setup a policy in IMA
(those LSMs might implement the policy capability themselves).
Patches 7-9 allow eBPF-based LSMs to evaluate files read by the kernel.
Changelog
v2:
- Add better description to patch 1 (suggested by Shuah)
- Recalculate digest if it is not fresh (when IMA_COLLECTED flag not set)
- Move declaration of bpf_ima_file_hash() at the end (suggested by
Yonghong)
- Add tests to check if the digest has been recalculated
- Add deny test for bpf_kernel_read_file()
- Add description to tests
v1:
- Modify ima_file_hash() only and allow the usage of the function with the
modified behavior by eBPF-based LSMs through the new function
bpf_ima_file_hash() (suggested by Mimi)
- Make bpf_lsm_kernel_read_file() sleepable so that bpf_ima_inode_hash()
and bpf_ima_file_hash() can be called inside the implementation of
eBPF-based LSMs for this hook
====================
Roberto Sassu [Wed, 2 Mar 2022 11:14:03 +0000 (12:14 +0100)]
selftests/bpf: Add test for bpf_lsm_kernel_read_file()
Test the ability of bpf_lsm_kernel_read_file() to call the sleepable
functions bpf_ima_inode_hash() or bpf_ima_file_hash() to obtain a
measurement of a loaded IMA policy.
Roberto Sassu [Wed, 2 Mar 2022 11:14:02 +0000 (12:14 +0100)]
bpf-lsm: Make bpf_lsm_kernel_read_file() as sleepable
Make bpf_lsm_kernel_read_file() as sleepable, so that bpf_ima_inode_hash()
or bpf_ima_file_hash() can be called inside the implementation of this
hook.
Roberto Sassu [Wed, 2 Mar 2022 11:14:01 +0000 (12:14 +0100)]
selftests/bpf: Check if the digest is refreshed after a file write
Verify that bpf_ima_inode_hash() returns a non-fresh digest after a file
write, and that bpf_ima_file_hash() returns a fresh digest. Verification is
done by requesting the digest from the bprm_creds_for_exec hook, called
before ima_bprm_check().
Roberto Sassu [Wed, 2 Mar 2022 11:13:58 +0000 (12:13 +0100)]
bpf-lsm: Introduce new helper bpf_ima_file_hash()
ima_file_hash() has been modified to calculate the measurement of a file on
demand, if it has not been already performed by IMA or the measurement is
not fresh. For compatibility reasons, ima_inode_hash() remains unchanged.
Keep the same approach in eBPF and introduce the new helper
bpf_ima_file_hash() to take advantage of the modified behavior of
ima_file_hash().
Roberto Sassu [Wed, 2 Mar 2022 11:13:57 +0000 (12:13 +0100)]
ima: Always return a file measurement in ima_file_hash()
__ima_inode_hash() checks if a digest has been already calculated by
looking for the integrity_iint_cache structure associated to the passed
inode.
Users of ima_file_hash() (e.g. eBPF) might be interested in obtaining the
information without having to setup an IMA policy so that the digest is
always available at the time they call this function.
In addition, they likely expect the digest to be fresh, e.g. recalculated
by IMA after a file write. Although getting the digest from the
bprm_committed_creds hook (as in the eBPF test) ensures that the digest is
fresh, as the IMA hook is executed before that hook, this is not always the
case (e.g. for the mmap_file hook).
Call ima_collect_measurement() in __ima_inode_hash(), if the file
descriptor is available (passed by ima_file_hash()) and the digest is not
available/not fresh, and store the file measurement in a temporary
integrity_iint_cache structure.
This change does not cause memory usage increase, due to using the
temporary integrity_iint_cache structure, and due to freeing the
ima_digest_data structure inside integrity_iint_cache before exiting from
__ima_inode_hash().
For compatibility reasons, the behavior of ima_inode_hash() remains
unchanged.
Roberto Sassu [Wed, 2 Mar 2022 11:13:56 +0000 (12:13 +0100)]
ima: Fix documentation-related warnings in ima_main.c
Fix the following warnings in ima_main.c, displayed with W=n make argument:
security/integrity/ima/ima_main.c:432: warning: Function parameter or
member 'vma' not described in 'ima_file_mprotect'
security/integrity/ima/ima_main.c:636: warning: Function parameter or
member 'inode' not described in 'ima_post_create_tmpfile'
security/integrity/ima/ima_main.c:636: warning: Excess function parameter
'file' description in 'ima_post_create_tmpfile'
security/integrity/ima/ima_main.c:843: warning: Function parameter or
member 'load_id' not described in 'ima_post_load_data'
security/integrity/ima/ima_main.c:843: warning: Excess function parameter
'id' description in 'ima_post_load_data'
Also, fix some style issues in the description of ima_post_create_tmpfile()
and ima_post_path_mknod().
Chris J Arges [Wed, 9 Mar 2022 21:41:58 +0000 (15:41 -0600)]
bpftool: Ensure bytes_memlock json output is correct
If a BPF map is created over 2^32 the memlock value as displayed in JSON
format will be incorrect. Use atoll instead of atoi so that the correct
number is displayed.
```
$ bpftool map create /sys/fs/bpf/test_bpfmap type hash key 4 \
value 1024 entries 4194304 name test_bpfmap
$ bpftool map list
1: hash name test_bpfmap flags 0x0
key 4B value 1024B max_entries 4194304 memlock 4328521728B
$ sudo bpftool map list -j | jq .[].bytes_memlock 33554432
```