Brendan Jackman [Tue, 2 Feb 2021 13:50:02 +0000 (13:50 +0000)]
bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH
When BPF_FETCH is set, atomic instructions load a value from memory
into a register. The current verifier code first checks via
check_mem_access whether we can access the memory, and then checks
via check_reg_arg whether we can write into the register.
For loads, check_reg_arg has the side-effect of marking the
register's value as unkonwn, and check_mem_access has the side effect
of propagating bounds from memory to the register. This currently only
takes effect for stack memory.
Therefore with the current order, bounds information is thrown away,
but by simply reversing the order of check_reg_arg
vs. check_mem_access, we can instead propagate bounds smartly.
A simple test is added with an infinite loop that can only be proved
unreachable if this propagation is present. This is implemented both
with C and directly in test_verifier using assembly.
Tiezhu Yang [Tue, 26 Jan 2021 14:05:25 +0000 (22:05 +0800)]
samples/bpf: Add include dir for MIPS Loongson64 to fix build errors
There exists many build errors when make M=samples/bpf on the Loongson
platform. This issue is MIPS related, x86 compiles just fine.
Here are some errors:
CLANG-bpf samples/bpf/sockex2_kern.o
In file included from samples/bpf/sockex2_kern.c:2:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:8:
In file included from ./include/linux/kernel.h:11:
In file included from ./include/linux/bitops.h:32:
In file included from ./arch/mips/include/asm/bitops.h:19:
In file included from ./arch/mips/include/asm/barrier.h:11:
./arch/mips/include/asm/addrspace.h:13:10: fatal error: 'spaces.h' file not found
^~~~~~~~~~
1 error generated.
CLANG-bpf samples/bpf/sockex2_kern.o
In file included from samples/bpf/sockex2_kern.c:2:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:8:
In file included from ./include/linux/kernel.h:11:
In file included from ./include/linux/bitops.h:32:
In file included from ./arch/mips/include/asm/bitops.h:22:
In file included from ./arch/mips/include/asm/cpu-features.h:13:
In file included from ./arch/mips/include/asm/cpu-info.h:15:
In file included from ./include/linux/cache.h:6:
./arch/mips/include/asm/cache.h:12:10: fatal error: 'kmalloc.h' file not found
^~~~~~~~~~~
1 error generated.
CLANG-bpf samples/bpf/sockex2_kern.o
In file included from samples/bpf/sockex2_kern.c:2:
In file included from ./include/uapi/linux/in.h:24:
In file included from ./include/linux/socket.h:8:
In file included from ./include/linux/uio.h:8:
In file included from ./include/linux/kernel.h:11:
In file included from ./include/linux/bitops.h:32:
In file included from ./arch/mips/include/asm/bitops.h:22:
./arch/mips/include/asm/cpu-features.h:15:10: fatal error: 'cpu-feature-overrides.h' file not found
^~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
In the arch/mips/Makefile, there exists the following board-dependent
options:
include arch/mips/Kbuild.platforms
cflags-y += -I$(srctree)/arch/mips/include/asm/mach-generic
So we can do the similar things in samples/bpf/Makefile, just add
platform specific and generic include dir for MIPS Loongson64 to
fix the build errors.
Tobias Klauser [Wed, 27 Jan 2021 17:46:15 +0000 (18:46 +0100)]
bpf: Simplify cases in bpf_base_func_proto
!perfmon_capable() is checked before the last switch(func_id) in
bpf_base_func_proto. Thus, the cases BPF_FUNC_trace_printk and
BPF_FUNC_snprintf_btf can be moved to that last switch(func_id) to omit
the inline !perfmon_capable() checks.
bpf: Enable bpf_{g,s}etsockopt in BPF_CGROUP_UDP{4,6}_RECVMSG
Those hooks run as BPF_CGROUP_RUN_SA_PROG_LOCK and operate on a locked socket.
Note that we could remove the switch for prog->expected_attach_type altogether
since all current sock_addr attach types are covered. However, it makes sense
to keep it as a safe-guard in case new sock_addr attach types are added that
might not operate on a locked socket. Therefore, avoid to let this slip through.
Sedat Dilek [Thu, 28 Jan 2021 01:50:58 +0000 (02:50 +0100)]
tools: Factor Clang, LLC and LLVM utils definitions
When dealing with BPF/BTF/pahole and DWARF v5 I wanted to build bpftool.
While looking into the source code I found duplicate assignments in misc tools
for the LLVM eco system, e.g. clang and llvm-objcopy.
Move the Clang, LLC and/or LLVM utils definitions to tools/scripts/Makefile.include
file and add missing includes where needed. Honestly, I was inspired by the commit df0b526fe59f ("tools: Factor HOSTCC, HOSTLD, HOSTAR definitions").
I tested with bpftool and perf on Debian/testing AMD64 and LLVM/Clang v11.1.0-rc1.
Build instructions:
[ make and make-options ]
MAKE="make V=1"
MAKE_OPTS="HOSTCC=clang HOSTCXX=clang++ HOSTLD=ld.lld CC=clang LD=ld.lld LLVM=1 LLVM_IAS=1"
MAKE_OPTS="$MAKE_OPTS PAHOLE=/opt/pahole/bin/pahole"
bpf: Allow rewriting to ports under ip_unprivileged_port_start
At the moment, BPF_CGROUP_INET{4,6}_BIND hooks can rewrite user_port
to the privileged ones (< ip_unprivileged_port_start), but it will
be rejected later on in the __inet_bind or __inet6_bind.
Let's add another return value to indicate that CAP_NET_BIND_SERVICE
check should be ignored. Use the same idea as we currently use
in cgroup/egress where bit #1 indicates CN. Instead, for
cgroup/bind{4,6}, bit #1 indicates that CAP_NET_BIND_SERVICE should
be bypassed.
v5:
- rename flags to be less confusing (Andrey Ignatov)
- rework BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY to work on flags
and accept BPF_RET_SET_CN (no behavioral changes)
Andrii Nakryiko [Tue, 26 Jan 2021 06:50:18 +0000 (22:50 -0800)]
selftests/bpf: Don't exit on failed bpf_testmod unload
Fix bug in handling bpf_testmod unloading that will cause test_progs exiting
prematurely if bpf_testmod unloading failed. This is especially problematic
when running a subset of test_progs that doesn't require root permissions and
doesn't rely on bpf_testmod, yet will fail immediately due to exit(1) in
unload_bpf_testmod().
Tiezhu Yang [Mon, 25 Jan 2021 05:05:46 +0000 (13:05 +0800)]
samples/bpf: Set flag __SANE_USERSPACE_TYPES__ for MIPS to fix build warnings
There exists many build warnings when make M=samples/bpf on the Loongson
platform, this issue is MIPS related, x86 compiles just fine.
Here are some warnings:
CC samples/bpf/ibumad_user.o
samples/bpf/ibumad_user.c: In function ‘dump_counts’:
samples/bpf/ibumad_user.c:46:24: warning: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘__u64’ {aka ‘long unsigned int’} [-Wformat=]
printf("0x%02x : %llu\n", key, value);
~~~^ ~~~~~
%lu
CC samples/bpf/offwaketime_user.o
samples/bpf/offwaketime_user.c: In function ‘print_ksym’:
samples/bpf/offwaketime_user.c:34:17: warning: format ‘%llx’ expects argument of type ‘long long unsigned int’, but argument 3 has type ‘__u64’ {aka ‘long unsigned int’} [-Wformat=]
printf("%s/%llx;", sym->name, addr);
~~~^ ~~~~
%lx
samples/bpf/offwaketime_user.c: In function ‘print_stack’:
samples/bpf/offwaketime_user.c:68:17: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘__u64’ {aka ‘long unsigned int’} [-Wformat=]
printf(";%s %lld\n", key->waker, count);
~~~^ ~~~~~
%ld
MIPS needs __SANE_USERSPACE_TYPES__ before <linux/types.h> to select
'int-ll64.h' in arch/mips/include/uapi/asm/types.h, then it can avoid
build warnings when printing __u64 with %llu, %llx or %lld.
The header tools/include/linux/types.h defines __SANE_USERSPACE_TYPES__,
it seems that we can include <linux/types.h> in the source files which
have build warnings, but it has no effect due to actually it includes
usr/include/linux/types.h instead of tools/include/linux/types.h, the
problem is that "usr/include" is preferred first than "tools/include"
in samples/bpf/Makefile, that sounds like a ugly hack to -Itools/include
before -Iusr/include.
So define __SANE_USERSPACE_TYPES__ for MIPS in samples/bpf/Makefile
is proper, if add "TPROGS_CFLAGS += -D__SANE_USERSPACE_TYPES__" in
samples/bpf/Makefile, it appears the following error:
Auto-detecting system features:
... libelf: [ on ]
... zlib: [ on ]
... bpf: [ OFF ]
BPF API too old
make[3]: *** [Makefile:293: bpfdep] Error 1
make[2]: *** [Makefile:156: all] Error 2
With #ifndef __SANE_USERSPACE_TYPES__ in tools/include/linux/types.h,
the above error has gone and this ifndef change does not hurt other
compilations.
Björn Töpel [Fri, 22 Jan 2021 10:53:51 +0000 (11:53 +0100)]
libbpf, xsk: Select AF_XDP BPF program based on kernel version
Add detection for kernel version, and adapt the BPF program based on
kernel support. This way, users will get the best possible performance
from the BPF program.
Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Marek Majtyka <alardam@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://lore.kernel.org/bpf/20210122105351.11751-4-bjorn.topel@gmail.com
Björn Töpel [Fri, 22 Jan 2021 10:53:49 +0000 (11:53 +0100)]
xsk: Remove explicit_free parameter from __xsk_rcv()
The explicit_free parameter of the __xsk_rcv() function was used to
mark whether the call was via the generic XDP or the native XDP
path. Instead of clutter the code with if-statements and "true/false"
parameters which are hard to understand, simply move the explicit free
to the __xsk_map_redirect() which is always called from the native XDP
path.
Hangbin Liu [Fri, 22 Jan 2021 02:50:07 +0000 (10:50 +0800)]
samples/bpf: Add xdp program on egress for xdp_redirect_map
This patch add a xdp program on egress to show that we can modify
the packet on egress. In this sample we will set the pkt's src
mac to egress's mac address. The xdp_prog will be attached when
-X option supplied.
Tiezhu Yang [Fri, 22 Jan 2021 01:39:44 +0000 (09:39 +0800)]
bpf, docs: Update build procedure for manually compiling LLVM and Clang
The current LLVM and Clang build procedure in samples/bpf/README.rst is
out of date. See below that the links are not accessible any more.
$ git clone http://llvm.org/git/llvm.git
Cloning into 'llvm'...
fatal: unable to access 'http://llvm.org/git/llvm.git/': Maximum (20) redirects followed
$ git clone --depth 1 http://llvm.org/git/clang.git
Cloning into 'clang'...
fatal: unable to access 'http://llvm.org/git/clang.git/': Maximum (20) redirects followed
The LLVM community has adopted new ways to build the compiler. There are
different ways to build LLVM and Clang, the Clang Getting Started page [1]
has one way. As Yonghong said, it is better to copy the build procedure
in Documentation/bpf/bpf_devel_QA.rst to keep consistent.
I verified the procedure and it is proved to be feasible, so we should
update README.rst to reflect the reality. At the same time, update the
related comment in Makefile.
Additionally, as Fangrui said, the dir llvm-project/llvm/build/install is
not used, BUILD_SHARED_LIBS=OFF is the default option [2], so also change
Documentation/bpf/bpf_devel_QA.rst together.
At last, we recommend that developers who want the fastest incremental
builds use the Ninja build system [1], you can find it in your system's
package manager, usually the package is ninja or ninja-build [3], so add
ninja to build dependencies suggested by Nathan.
Merge branch 'bpf: misc performance improvements for cgroup'
Stanislav Fomichev says:
====================
First patch adds custom getsockopt for TCP_ZEROCOPY_RECEIVE
to remove kmalloc and lock_sock overhead from the dat path.
Second patch removes kzalloc/kfree from getsockopt for the common cases.
Third patch switches cgroup_bpf_enabled to be per-attach to
to add only overhead for the cgroup attach types used on the system.
No visible user-side changes.
v9:
- include linux/tcp.h instead of netinet/tcp.h in sockopt_sk.c
- note that v9 depends on the commit bf5b9b61d7ff ("bpf: Don't leak
memory in bpf getsockopt when optlen == 0") from bpf tree
v8:
- add bpi.h to tools/include/uapi in the same patch (Martin KaFai Lau)
- kmalloc instead of kzalloc when exporting buffer (Martin KaFai Lau)
- note that v8 depends on the commit bf5b9b61d7ff ("bpf: Don't leak
memory in bpf getsockopt when optlen == 0") from bpf tree
v7:
- add comment about buffer contents for retval != 0 (Martin KaFai Lau)
- export tcp.h into tools/include/uapi (Martin KaFai Lau)
- note that v7 depends on the commit bf5b9b61d7ff ("bpf: Don't leak
memory in bpf getsockopt when optlen == 0") from bpf tree
v6:
- avoid indirect cost for new bpf_bypass_getsockopt (Eric Dumazet)
v5:
- reorder patches to reduce the churn (Martin KaFai Lau)
v3:
- remove extra newline, add comment about sizeof tcp_zerocopy_receive
(Martin KaFai Lau)
- add another patch to remove lock_sock overhead from
TCP_ZEROCOPY_RECEIVE; technically, this makes patch #1 obsolete,
but I'd still prefer to keep it to help with other socket
options
When we attach any cgroup hook, the rest (even if unused/unattached) start
to contribute small overhead. In particular, the one we want to avoid is
__cgroup_bpf_run_filter_skb which does two redirections to get to
the cgroup and pushes/pulls skb.
Let's split cgroup_bpf_enabled to be per-attach to make sure
only used attach types trigger.
I've dropped some existing high-level cgroup_bpf_enabled in some
places because BPF_PROG_CGROUP_XXX_RUN macros usually have another
cgroup_bpf_enabled check.
I also had to copy-paste BPF_CGROUP_RUN_SA_PROG_LOCK for
GETPEERNAME/GETSOCKNAME because type for cgroup_bpf_enabled[type]
has to be constant and known at compile time.
bpf: Try to avoid kzalloc in cgroup/{s,g}etsockopt
When we attach a bpf program to cgroup/getsockopt any other getsockopt()
syscall starts incurring kzalloc/kfree cost.
Let add a small buffer on the stack and use it for small (majority)
{s,g}etsockopt values. The buffer is small enough to fit into
the cache line and cover the majority of simple options (most
of them are 4 byte ints).
It seems natural to do the same for setsockopt, but it's a bit more
involved when the BPF program modifies the data (where we have to
kmalloc). The assumption is that for the majority of setsockopt
calls (which are doing pure BPF options or apply policy) this
will bring some benefit as well.
Without this patch (we remove about 1% __kmalloc):
3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt
|
--3.30%--__cgroup_bpf_run_filter_getsockopt
|
--0.81%--__kmalloc
bpf: Remove extra lock_sock for TCP_ZEROCOPY_RECEIVE
Add custom implementation of getsockopt hook for TCP_ZEROCOPY_RECEIVE.
We skip generic hooks for TCP_ZEROCOPY_RECEIVE and have a custom
call in do_tcp_getsockopt using the on-stack data. This removes
3% overhead for locking/unlocking the socket.
Without this patch:
3.38% 0.07% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt
|
--3.30%--__cgroup_bpf_run_filter_getsockopt
|
--0.81%--__kmalloc
With the patch applied:
0.52% 0.12% tcp_mmap [kernel.kallsyms] [k] __cgroup_bpf_run_filter_getsockopt_kern
Note, exporting uapi/tcp.h requires removing netinet/tcp.h
from test_progs.h because those headers have confliciting
definitions.
Yonghong Song [Tue, 19 Jan 2021 15:35:18 +0000 (07:35 -0800)]
bpf: Permit size-0 datasec
llvm patch https://reviews.llvm.org/D84002 permitted
to emit empty rodata datasec if the elf .rodata section
contains read-only data from local variables. These
local variables will be not emitted as BTF_KIND_VARs
since llvm converted these local variables as
static variables with private linkage without debuginfo
types. Such an empty rodata datasec will make
skeleton code generation easy since for skeleton
a rodata struct will be generated if there is a
.rodata elf section. The existence of a rodata
btf datasec is also consistent with the existence
of a rodata map created by libbpf.
The btf with such an empty rodata datasec will fail
in the kernel though as kernel will reject a datasec
with zero vlen and zero size. For example, for the below code,
int sys_enter(void *ctx)
{
int fmt[6] = {1, 2, 3, 4, 5, 6};
int dst[6];
bpf_probe_read(dst, sizeof(dst), fmt);
return 0;
}
We got the below btf (bpftool btf dump ./test.o):
[1] PTR '(anon)' type_id=0
[2] FUNC_PROTO '(anon)' ret_type_id=3 vlen=1
'ctx' type_id=1
[3] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[4] FUNC 'sys_enter' type_id=2 linkage=global
[5] INT 'char' size=1 bits_offset=0 nr_bits=8 encoding=SIGNED
[6] ARRAY '(anon)' type_id=5 index_type_id=7 nr_elems=4
[7] INT '__ARRAY_SIZE_TYPE__' size=4 bits_offset=0 nr_bits=32 encoding=(none)
[8] VAR '_license' type_id=6, linkage=global-alloc
[9] DATASEC '.rodata' size=0 vlen=0
[10] DATASEC 'license' size=0 vlen=1
type_id=8 offset=0 size=4
When loading the ./test.o to the kernel with bpftool,
we see the following error:
libbpf: Error loading BTF: Invalid argument(22)
libbpf: magic: 0xeb9f
...
[6] ARRAY (anon) type_id=5 index_type_id=7 nr_elems=4
[7] INT __ARRAY_SIZE_TYPE__ size=4 bits_offset=0 nr_bits=32 encoding=(none)
[8] VAR _license type_id=6 linkage=1
[9] DATASEC .rodata size=24 vlen=0 vlen == 0
libbpf: Error loading .BTF into kernel: -22. BTF is optional, ignoring.
Basically, libbpf changed .rodata datasec size to 24 since elf .rodata
section size is 24. The kernel then rejected the BTF since vlen = 0.
Note that the above kernel verifier failure can be worked around with
changing local variable "fmt" to a static or global, optionally const, variable.
This patch permits a datasec with vlen = 0 in kernel.
Merge branch 'bpf,x64: implement jump padding in jit'
Gary Lin says:
====================
This patch series implements jump padding to x64 jit to cover some
corner cases that used to consume more than 20 jit passes and caused
failure.
v4:
- Add the detailed comments about the possible padding bytes
- Add the second test case which triggers jmp_cond padding and imm32 nop
jmp padding.
- Add the new test case as another subprog
v3:
- Copy the instructions of prologue separately or the size calculation
of the first BPF instruction would include the prologue.
- Replace WARN_ONCE() with pr_err() and EFAULT
- Use MAX_PASSES in the for loop condition check
- Remove the "padded" flag from x64_jit_data. For the extra pass of
subprogs, padding is always enabled since it won't hurt the images
that converge without padding.
v2:
- Simplify the sample code in the commit description and provide the
jit code
- Check the expected padding bytes with WARN_ONCE
- Move the 'padded' flag to 'struct x64_jit_data'
- Remove the EXPECTED_FAIL flag from bpf_fill_maxinsns11() in test_bpf
- Add 2 verifier tests
====================
Qais Yousef [Tue, 19 Jan 2021 12:22:36 +0000 (12:22 +0000)]
trace: bpf: Allow bpf to attach to bare tracepoints
Some subsystems only have bare tracepoints (a tracepoint with no
associated trace event) to avoid the problem of trace events being an
ABI that can't be changed.
>From bpf presepective, bare tracepoints are what it calls
RAW_TRACEPOINT().
Since bpf assumed there's 1:1 mapping, it relied on hooking to
DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
no knowledge about their existence.
By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.
Enabling that comes with the contract that changes to raw tracepoints
don't constitute a regression if they break existing bpf programs.
We need the ability to continue to morph and modify these raw
tracepoints without worrying about any ABI.
Update Documentation/bpf/bpf_design_QA.rst to document this contract.
We first store a random number to r0 and add the corresponding
conditional jumps (2~129) to make verifier believe that those jump
instructions from 130 to 257 are reachable. When the program is sent to
x64 jit, it starts to optimize out the NOP jumps backwards from 257.
Since there are 128 such jumps, the program easily reaches 15 passes and
triggers jump padding.
As expected, a 3 bytes NOPs is inserted at 608 due to the transition
from imm32 jmp to imm8 jmp. A 2 bytes NOPs is also inserted at 687 to
replace a NOP jump.
The second test case is tricky. Here is the assembly code:
There are 4 major parts of the program.
1) 1~2049: Those are instructions to make 2050~4098 reachable. Some of
them also could generate the padding for jmp_cond.
2) 2050: This is the target instruction for the imm32 nop jmp padding.
3) 2051~4098: The repeated "goto 1~16" instructions are designed to be
consumed by the nop jmp optimization. In the end, those
instrucitons become 128 continuous 0 offset jmp and are
optimized out in 1 pass, and this make insn 2050 an imm32
nop jmp in the next pass, so that we can trigger the
5 bytes padding.
4) 4099~4100: Those are the instructions to end the program.
Since insn 2051~4098 are optimized out right before the padding pass,
there are several conditional jumps from the first part are replaced with
imm8 jmp_cond, and this triggers the 4 bytes padding, for example at
6673, 6680, and 668d. On the other hand, Insn 2050 is replaced with the
5 bytes nops at 6693.
The third test is to invoke the first and second tests as subprogs to test
bpf2bpf. Per the system log, there was one more jit happened with only
one pass and the same jit code was produced.
v4:
- Add the second test case which triggers jmp_cond padding and imm32 nop
jmp padding.
- Add the new test case as another subprog
Gary Lin [Tue, 19 Jan 2021 10:24:59 +0000 (18:24 +0800)]
bpf,x64: Pad NOPs to make images converge more easily
The x64 bpf jit expects bpf images converge within the given passes, but
it could fail to do so with some corner cases. For example:
l0: ja 40
l1: ja 40
[... repeated ja 40 ]
l39: ja 40
l40: ret #0
This bpf program contains 40 "ja 40" instructions which are effectively
NOPs and designed to be replaced with valid code dynamically. Ideally,
bpf jit should optimize those "ja 40" instructions out when translating
the bpf instructions into x64 machine code. However, do_jit() can only
remove one "ja 40" for offset==0 on each pass, so it requires at least
40 runs to eliminate those JMPs and exceeds the current limit of
passes(20). In the end, the program got rejected when BPF_JIT_ALWAYS_ON
is set even though it's legit as a classic socket filter.
To make bpf images more likely converge within 20 passes, this commit
pads some instructions with NOPs in the last 5 passes:
1. conditional jumps
A possible size variance comes from the adoption of imm8 JMP. If the
offset is imm8, we calculate the size difference of this BPF instruction
between the previous and the current pass and fill the gap with NOPs.
To avoid the recalculation of jump offset, those NOPs are inserted before
the JMP code, so we have to subtract the 2 bytes of imm8 JMP when
calculating the NOP number.
2. BPF_JA
There are two conditions for BPF_JA.
a.) nop jumps
If this instruction is not optimized out in the previous pass,
instead of removing it, we insert the equivalent size of NOPs.
b.) label jumps
Similar to condition jumps, we prepend NOPs right before the JMP
code.
To make the code concise, emit_nops() is modified to use the signed len and
return the number of inserted NOPs.
For bpf-to-bpf, we always enable padding for the extra pass since there
is only one extra run and the jump padding doesn't affected the images
that converge without padding.
After applying this patch, the corner case was loaded with the following
jit code:
At the 16th pass, 15 jumps were already optimized out, and one jump was
replaced with NOPs at 44 and the image converged at the 17th pass.
v4:
- Add the detailed comments about the possible padding bytes
v3:
- Copy the instructions of prologue separately or the size calculation
of the first BPF instruction would include the prologue.
- Replace WARN_ONCE() with pr_err() and EFAULT
- Use MAX_PASSES in the for loop condition check
- Remove the "padded" flag from x64_jit_data. For the extra pass of
subprogs, padding is always enabled since it won't hurt the images
that converge without padding.
v2:
- Simplify the sample code in the description and provide the jit code
- Check the expected padding bytes with WARN_ONCE
- Move the 'padded' flag to 'struct x64_jit_data'
Björn Töpel [Mon, 18 Jan 2021 09:17:53 +0000 (10:17 +0100)]
samples/bpf: Add BPF_ATOMIC_OP macro for BPF samples
Brendan Jackman added extend atomic operations to the BPF instruction
set in commit dbca2921d35f ("Merge branch 'Atomics for eBPF'"), which
introduces the BPF_ATOMIC_OP macro. However, that macro was missing
for the BPF samples. Fix that by adding it into bpf_insn.h.
Fixes: 0d9d2636cd2a ("bpf: Rename BPF_XADD and prepare to encode other atomics in .imm") Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: Brendan Jackman <jackmanb@google.com> Link: https://lore.kernel.org/bpf/20210118091753.107572-1-bjorn.topel@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Introduce xdp_build_skb_from_frame utility routine to build the skb
from xdp_frame. Respect to __xdp_build_skb_from_frame,
xdp_build_skb_from_frame will allocate the skb object. Rely on
xdp_build_skb_from_frame in veth driver.
Introduce missing xdp metadata support in veth_xdp_rcv_one routine.
Add missing metadata support in veth_xdp_rcv_one().
drivers/net/can/dev.c
commit e48a7c745f53 ("can: dev: can_restart: fix use after free bug")
commit 6ac4eb9449c8 ("can: dev: move driver related infrastructure into separate subdir")
Code move.
drivers/net/dsa/b53/b53_common.c
commit 8d28f2bb5dc5 ("net: dsa: b53: fix an off by one in checking "vlan->vid"")
commit 8f7c1150c015 ("net: switchdev: remove vid_begin -> vid_end range from VLAN objects")
Linus Torvalds [Wed, 20 Jan 2021 19:52:21 +0000 (11:52 -0800)]
Merge tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.11-rc5, including fixes from bpf, wireless, and
can trees.
Current release - regressions:
- nfc: nci: fix the wrong NCI_CORE_INIT parameters
Current release - new code bugs:
- bpf: allow empty module BTFs
Previous releases - regressions:
- bpf: fix signed_{sub,add32}_overflows type handling
- tcp: do not mess with cloned skbs in tcp_add_backlog()
- bpf: prevent double bpf_prog_put call from bpf_tracing_prog_attach
- bpf: don't leak memory in bpf getsockopt when optlen == 0
- tcp: fix potential use-after-free due to double kfree()
- mac80211: fix encryption issues with WEP
- devlink: use right genl user_ptr when handling port param get/set
- ipv6: set multicast flag on the multicast route
- tcp: fix TCP_USER_TIMEOUT with zero window
Previous releases - always broken:
- bpf: local storage helpers should check nullness of owner ptr passed
- mac80211: fix incorrect strlen of .write in debugfs
- cls_flower: call nla_ok() before nla_next()
- skbuff: back tiny skbs with kmalloc() in __netdev_alloc_skb() too"
* tag 'net-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (52 commits)
net: systemport: free dev before on error path
net: usb: cdc_ncm: don't spew notifications
net: mscc: ocelot: Fix multicast to the CPU port
tcp: Fix potential use-after-free due to double kfree()
bpf: Fix signed_{sub,add32}_overflows type handling
can: peak_usb: fix use after free bugs
can: vxcan: vxcan_xmit: fix use after free bug
can: dev: can_restart: fix use after free bug
tcp: fix TCP socket rehash stats mis-accounting
net: dsa: b53: fix an off by one in checking "vlan->vid"
tcp: do not mess with cloned skbs in tcp_add_backlog()
selftests: net: fib_tests: remove duplicate log test
net: nfc: nci: fix the wrong NCI_CORE_INIT parameters
sh_eth: Fix power down vs. is_opened flag ordering
net: Disable NETIF_F_HW_TLS_RX when RXCSUM is disabled
netfilter: rpfilter: mask ecn bits before fib lookup
udp: mask TOS bits in udp_v4_early_demux()
xsk: Clear pool even for inactive queues
bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
sh_eth: Make PHY access aware of Runtime PM to fix reboot crash
...
Tianjia Zhang [Tue, 19 Jan 2021 00:13:19 +0000 (00:13 +0000)]
X.509: Fix crash caused by NULL pointer
On the following call path, `sig->pkey_algo` is not assigned
in asymmetric_key_verify_signature(), which causes runtime
crash in public_key_verify_signature().
Takashi Iwai [Wed, 20 Jan 2021 16:11:12 +0000 (16:11 +0000)]
cachefiles: Drop superfluous readpages aops NULL check
After the recent actions to convert readpages aops to readahead, the
NULL checks of readpages aops in cachefiles_read_or_alloc_page() may
hit falsely. More badly, it's an ASSERT() call, and this panics.
Drop the superfluous NULL checks for fixing this regression.
[DH: Note that cachefiles never actually used readpages, so this check was
never actually necessary]
BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=208883 BugLink: https://bugzilla.opensuse.org/show_bug.cgi?id=1175245 Fixes: 5f60092fcbc4 ("CacheFiles: A cache that backs onto a mounted filesystem") Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
All three patches are by Vincent Mailhol and fix a potential use after free bug
in the CAN device infrastructure, the vxcan driver, and the peak_usk driver. In
the TX-path the skb is used to read from after it was passed to the networking
stack with netif_rx_ni().
* tag 'linux-can-fixes-for-5.11-20210120' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can:
can: peak_usb: fix use after free bugs
can: vxcan: vxcan_xmit: fix use after free bug
can: dev: can_restart: fix use after free bug
====================
Grant Grundler [Wed, 20 Jan 2021 01:12:08 +0000 (17:12 -0800)]
net: usb: cdc_ncm: don't spew notifications
RTL8156 sends notifications about every 32ms.
Only display/log notifications when something changes.
This issue has been reported by others:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1832472
https://lkml.org/lkml/2020/8/27/1083
...
[785962.779840] usb 1-1: new high-speed USB device number 5 using xhci_hcd
[785962.929944] usb 1-1: New USB device found, idVendor=0bda, idProduct=8156, bcdDevice=30.00
[785962.929949] usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=6
[785962.929952] usb 1-1: Product: USB 10/100/1G/2.5G LAN
[785962.929954] usb 1-1: Manufacturer: Realtek
[785962.929956] usb 1-1: SerialNumber: 000000001
[785962.991755] usbcore: registered new interface driver cdc_ether
[785963.017068] cdc_ncm 1-1:2.0: MAC-Address: 00:24:27:88:08:15
[785963.017072] cdc_ncm 1-1:2.0: setting rx_max = 16384
[785963.017169] cdc_ncm 1-1:2.0: setting tx_max = 16384
[785963.017682] cdc_ncm 1-1:2.0 usb0: register 'cdc_ncm' at usb-0000:00:14.0-1, CDC NCM, 00:24:27:88:08:15
[785963.019211] usbcore: registered new interface driver cdc_ncm
[785963.023856] usbcore: registered new interface driver cdc_wdm
[785963.025461] usbcore: registered new interface driver cdc_mbim
[785963.038824] cdc_ncm 1-1:2.0 enx002427880815: renamed from usb0
[785963.089586] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
[785963.121673] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
[785963.153682] cdc_ncm 1-1:2.0 enx002427880815: network connection: disconnected
...
This is about 2KB per second and will overwrite all contents of a 1MB
dmesg buffer in under 10 minutes rendering them useless for debugging
many kernel problems.
This is also an extra 180 MB/day in /var/logs (or 1GB per week) rendering
the majority of those logs useless too.
Alban Bedel [Tue, 19 Jan 2021 14:06:38 +0000 (15:06 +0100)]
net: mscc: ocelot: Fix multicast to the CPU port
Multicast entries in the MAC table use the high bits of the MAC
address to encode the ports that should get the packets. But this port
mask does not work for the CPU port, to receive these packets on the
CPU port the MAC_CPU_COPY flag must be set.
Because of this IPv6 was effectively not working because neighbor
solicitations were never received. This was not apparent before commit 6e0d2490 (net: mscc: ocelot: support IPv4, IPv6 and plain Ethernet mdb
entries) as the IPv6 entries were broken so all incoming IPv6
multicast was then treated as unknown and flooded on all ports.
To fix this problem rework the ocelot_mact_learn() to set the
MAC_CPU_COPY flag when a multicast entry that target the CPU port is
added. For this we have to read back the ports endcoded in the pseudo
MAC address by the caller. It is not a very nice design but that avoid
changing the callers and should make backporting easier.
tcp: Fix potential use-after-free due to double kfree()
Receiving ACK with a valid SYN cookie, cookie_v4_check() allocates struct
request_sock and then can allocate inet_rsk(req)->ireq_opt. After that,
tcp_v4_syn_recv_sock() allocates struct sock and copies ireq_opt to
inet_sk(sk)->inet_opt. Normally, tcp_v4_syn_recv_sock() inserts the full
socket into ehash and sets NULL to ireq_opt. Otherwise,
tcp_v4_syn_recv_sock() has to reset inet_opt by NULL and free the full
socket.
The commit 09acdfcff9769 ("tcp: fix race condition when creating child
sockets from syncookies") added a new path, in which more than one cores
create full sockets for the same SYN cookie. Currently, the core which
loses the race frees the full socket without resetting inet_opt, resulting
in that both sock_put() and reqsk_put() call kfree() for the same memory:
Calling kmalloc() between the double kfree() can lead to use-after-free, so
this patch fixes it by setting NULL to inet_opt before sock_put().
As a side note, this kind of issue does not happen for IPv6. This is
because tcp_v6_syn_recv_sock() clones both ipv6_opt and pktopts which
correspond to ireq_opt in IPv4.
Fixes: 09acdfcff976 ("tcp: fix race condition when creating child sockets from syncookies") CC: Ricardo Dias <rdias@singlestore.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: Benjamin Herrenschmidt <benh@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20210118055920.82516-1-kuniyu@amazon.co.jp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) Fix wrong bpf_map_peek_elem_proto helper callback, from Mircea Cirjaliu.
2) Fix signed_{sub,add32}_overflows type truncation, from Daniel Borkmann.
3) Fix AF_XDP to also clear pools for inactive queues, from Maxim Mikityanskiy.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf:
bpf: Fix signed_{sub,add32}_overflows type handling
xsk: Clear pool even for inactive queues
bpf: Fix helper bpf_map_peek_elem_proto pointing to wrong callback
====================
Daniel Borkmann [Tue, 19 Jan 2021 23:24:24 +0000 (00:24 +0100)]
bpf: Fix signed_{sub,add32}_overflows type handling
Fix incorrect signed_{sub,add32}_overflows() input types (and a related buggy
comment). It looks like this might have slipped in via copy/paste issue, also
given prior to 576785048b47 ("bpf: Verifier, do explicit ALU32 bounds tracking")
the signature of signed_sub_overflows() had s64 a and s64 b as its input args
whereas now they are truncated to s32. Thus restore proper types. Also, the case
of signed_add32_overflows() is not consistent to signed_sub32_overflows(). Both
have s32 as inputs, therefore align the former.
Fixes: 576785048b47 ("bpf: Verifier, do explicit ALU32 bounds tracking") Reported-by: De4dCr0w <sa516203@mail.ustc.edu.cn> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Alexei Starovoitov <ast@kernel.org>
Vincent Mailhol [Wed, 20 Jan 2021 11:41:37 +0000 (20:41 +0900)]
can: peak_usb: fix use after free bugs
After calling peak_usb_netif_rx_ni(skb), dereferencing skb is unsafe.
Especially, the can_frame cf which aliases skb memory is accessed
after the peak_usb_netif_rx_ni().
Vincent Mailhol [Wed, 20 Jan 2021 11:41:36 +0000 (20:41 +0900)]
can: vxcan: vxcan_xmit: fix use after free bug
After calling netif_rx_ni(skb), dereferencing skb is unsafe.
Especially, the canfd_frame cfd which aliases skb memory is accessed
after the netif_rx_ni().
Vincent Mailhol [Wed, 20 Jan 2021 11:41:35 +0000 (20:41 +0900)]
can: dev: can_restart: fix use after free bug
After calling netif_rx_ni(skb), dereferencing skb is unsafe.
Especially, the can_frame cf which aliases skb memory is accessed
after the netif_rx_ni() in:
stats->rx_bytes += cf->len;
Dan Carpenter [Tue, 19 Jan 2021 14:48:03 +0000 (17:48 +0300)]
net: dsa: b53: fix an off by one in checking "vlan->vid"
The > comparison should be >= to prevent accessing one element beyond
the end of the dev->vlans[] array in the caller function, b53_vlan_add().
The "dev->vlans" array is allocated in the b53_switch_init() function
and it has "dev->num_vlans" elements.
Fixes: c8215488b3e2 ("net: dsa: b53: Plug in VLAN support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/YAbxI97Dl/pmBy5V@mwanda Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jarod Wilson [Tue, 19 Jan 2021 01:09:27 +0000 (20:09 -0500)]
bonding: add a vlan+srcmac tx hashing option
This comes from an end-user request, where they're running multiple VMs on
hosts with bonded interfaces connected to some interest switch topologies,
where 802.3ad isn't an option. They're currently running a proprietary
solution that effectively achieves load-balancing of VMs and bandwidth
utilization improvements with a similar form of transmission algorithm.
Basically, each VM has it's own vlan, so it always sends its traffic out
the same interface, unless that interface fails. Traffic gets split
between the interfaces, maintaining a consistent path, with failover still
available if an interface goes down.
Unlike bond_eth_hash(), this hash function is using the full source MAC
address instead of just the last byte, as there are so few components to
the hash, and in the no-vlan case, we would be returning just the last
byte of the source MAC as the hash value. It's entirely possible to have
two NICs in a bond with the same last byte of their MAC, but not the same
MAC, so this adjustment should guarantee distinct hashes in all cases.
This has been rudimetarily tested to provide similar results to the
proprietary solution it is aiming to replace. A patch for iproute2 is also
posted, to properly support the new mode there as well.
Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Cc: Thomas Davis <tadavis@lbl.gov> Signed-off-by: Jarod Wilson <jarod@redhat.com> Link: https://lore.kernel.org/r/20210119010927.1191922-1-jarod@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 19 Jan 2021 16:56:56 +0000 (17:56 +0100)]
net: fix GSO for SG-enabled devices
The commit dc259dcea8f2 ("net: move the hsize check to the else
block in skb_segment") introduced a data corruption for devices
supporting scatter-gather.
The problem boils down to signed/unsigned comparison given
unexpected results: if signed 'hsize' is negative, it will be
considered greater than a positive 'len', which is unsigned.
This commit addresses resorting to the old checks order, so that
'hsize' never has a negative value when compared with 'len'.
Eric Dumazet [Tue, 19 Jan 2021 16:49:00 +0000 (08:49 -0800)]
tcp: do not mess with cloned skbs in tcp_add_backlog()
Heiner Kallweit reported that some skbs were sent with
the following invalid GSO properties :
- gso_size > 0
- gso_type == 0
This was triggerring a WARN_ON_ONCE() in rtl8169_tso_csum_v2.
Juerg Haefliger was able to reproduce a similar issue using
a lan78xx NIC and a workload mixing TCP incoming traffic
and forwarded packets.
The problem is that tcp_add_backlog() is writing
over gso_segs and gso_size even if the incoming packet will not
be coalesced to the backlog tail packet.
While skb_try_coalesce() would bail out if tail packet is cloned,
this overwriting would lead to corruptions of other packets
cooked by lan78xx, sharing a common super-packet.
The strategy used by lan78xx is to use a big skb, and split
it into all received packets using skb_clone() to avoid copies.
The drawback of this strategy is that all the small skb share a common
struct skb_shared_info.
This patch rewrites TCP gso_size/gso_segs handling to only
happen on the tail skb, since skb_try_coalesce() made sure
it was not cloned.
net: smsc911x: Make Runtime PM handling more fine-grained
Currently the smsc911x driver has mininal power management: during
driver probe, the device is powered up, and during driver remove, it is
powered down.
Improve power management by making it more fine-grained:
1. Power the device down when driver probe is finished,
2. Power the device (down) when it is opened (closed),
3. Make sure the device is powered during PHY access.
====================
net: ethernet: ti: am65-cpsw-nuss: introduce support for am64x cpsw3g
This series introduces basic support for recently introduced TI K3 AM642x SoC [1]
which contains 3 port (2 external ports) CPSW3g module. The CPSW3g integrated
in MAIN domain and can be configured in multi port or switch modes.
In this series only multi port mode is enabled. The initial version of switchdev
support was introduced by Vignesh Raghavendra [2] and work is in progress.
The overall functionality and DT bindings are similar to other K3 CPSWxg
versions, so DT binding changes are minimal and driver is mostly re-used for
TI K3 AM642x CPSW3g.
The main difference is that TI K3 AM642x SoC is not fully DMA coherent and
instead DMA coherency is supported per DMA channel.
Patches 1-2 - DT bindings update
Patches 3-4 - Update driver to support changed DMA coherency model
Patches 5-6 - adds TI K3 AM642x SoC platform data and so enable CPSW3g
net: ethernet: ti: am65-cpsw: add support for am64x cpsw3g
The TI AM64x SoCs Gigabit Ethernet Switch subsystem (CPSW3g NUSS) has three
ports (2 ext. ports) and provides Ethernet packet communication for the
device and can be configured in multi port mode or as an Ethernet switch.
This patch adds support for the corresponding CPSW3g version.
Peter Ujfalusi [Fri, 15 Jan 2021 19:28:51 +0000 (21:28 +0200)]
net: ethernet: ti: am65-cpsw-nuss: Support for transparent ASEL handling
Use the glue layer's functions to convert the dma_addr_t to and from CPPI5
address (with the ASEL bits), which should be used within the descriptors
and data buffers.
- Per channel coherency support
The DMAs use the 'ASEL' bits to select data and configuration fetch path.
The ASEL bits are placed at the unused parts of any address field used by
the DMAs (pointers to descriptors, addresses in descriptors, ring base
addresses). The ASEL is not part of the address (the DMAs can address
48bits). Individual channels can be configured to be coherent (via ACP
port) or non coherent individually by configuring the ASEL to appropriate
value.
dt-binding: net: ti: k3-am654-cpsw-nuss: update bindings for am64x cpsw3g
Update DT binding for recently introduced TI K3 AM642x SoC [1] which
contains 3 port (2 external ports) CPSW3g module. The CPSW3g integrated
in MAIN domain and can be configured in multi port or switch modes.
The overall functionality and DT bindings are similar to other K3 CPSWxg
versions, so DT binding changes are minimal:
- reword description
- add new compatible 'ti,am642-cpsw-nuss'
- allow 2 external ports child nodes
- add missed 'assigned-clock' props
[1] https://www.ti.com/lit/pdf/spruim2 Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
dt-binding: ti: am65x-cpts: add assigned-clock and power-domains props
The CPTS clock is usually a clk-mux which allows to select CPTS reference
clock by using 'assigned-clock-parents', 'assigned-clocks' DT properties.
Also depending on integration the power-domains has to be specified to
enable CPTS IP.
Hence add 'assigned-clock-parents', 'assigned-clocks' and 'power-domains'
properties to the CPTS DT bindings to avoid dtbs_check warnings:
cpts@310d0000: 'assigned-clock-parents', 'assigned-clocks' do not match any of the regexes: 'pinctrl-[0-9]+'
cpts@310d0000: 'power-domains' does not match any of the regexes: 'pinctrl-[0-9]+'
Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hangbin Liu [Tue, 19 Jan 2021 02:59:30 +0000 (10:59 +0800)]
selftests: net: fib_tests: remove duplicate log test
The previous test added an address with a specified metric and check if
correspond route was created. I somehow added two logs for the same
test. Remove the duplicated one.
Reported-by: Antoine Tenart <atenart@redhat.com> Fixes: 214cfd886071 ("selftests/net/fib_tests: update addr_metric_test for peer route testing") Signed-off-by: Hangbin Liu <liuhangbin@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210119025930.2810532-1-liuhangbin@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
sh_eth: Fix power down vs. is_opened flag ordering
sh_eth_close() does a synchronous power down of the device before
marking it closed. Revert the order, to make sure the device is never
marked opened while suspended.
While at it, use pm_runtime_put() instead of pm_runtime_put_sync(), as
there is no reason to do a synchronous power down.
Fixes: 67b31f67d2d97bcc ("sh_eth: Fix sleeping function called from invalid context") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Sergei Shtylyov <sergei.shtylyov@gmail.com> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Link: https://lore.kernel.org/r/20210118150812.796791-1-geert+renesas@glider.be Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: support SCTP CRC csum offload for tunneling packets in some drivers
This patchset introduces inline function skb_csum_is_sctp(), and uses it
to validate it's a sctp CRC csum offload packet, to make SCTP CRC csum
offload for tunneling packets supported in some HW drivers.
====================
Xin Long [Sat, 16 Jan 2021 06:13:42 +0000 (14:13 +0800)]
net: ixgbevf: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP CRC
checksum offload packet, and yet it also makes ixgbevf support SCTP
CRC checksum offload for UDP and GRE encapped packets, just as it
does in igb driver.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:41 +0000 (14:13 +0800)]
net: ixgbe: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP CRC
checksum offload packet, and yet it also makes ixgbe support SCTP
CRC checksum offload for UDP and GRE encapped packets, just as it
does in igb driver.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:40 +0000 (14:13 +0800)]
net: igc: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP CRC
checksum offload packet, and yet it also makes igc support SCTP
CRC checksum offload for UDP and GRE encapped packets, just as it
does in igb driver.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:39 +0000 (14:13 +0800)]
net: igbvf: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP CRC
checksum offload packet, and yet it also makes igbvf support SCTP
CRC checksum offload for UDP and GRE encapped packets, just as it
does in igb driver.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:38 +0000 (14:13 +0800)]
net: igb: use skb_csum_is_sctp instead of protocol check
Using skb_csum_is_sctp is a easier way to validate it's a SCTP
CRC checksum offload packet, and there is no need to parse the
packet to check its proto field, especially when it's a UDP or
GRE encapped packet.
So this patch also makes igb support SCTP CRC checksum offload
for UDP and GRE encapped packets.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Sat, 16 Jan 2021 06:13:37 +0000 (14:13 +0800)]
net: add inline function skb_csum_is_sctp
This patch is to define a inline function skb_csum_is_sctp(), and
also replace all places where it checks if it's a SCTP CSUM skb.
This function would be used later in many networking drivers in
the following patches.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Functions that end up calling fib_table_lookup() should clear the ECN
bits from the TOS, otherwise ECT(0) and ECT(1) packets can be treated
differently.
Most functions already clear the ECN bits, but there are a few cases
where this is not done. This series only fixes the ones related to
source address validation.
====================
Guillaume Nault [Sat, 16 Jan 2021 10:44:26 +0000 (11:44 +0100)]
netfilter: rpfilter: mask ecn bits before fib lookup
RT_TOS() only masks one of the two ECN bits. Therefore rpfilter_mt()
treats Not-ECT or ECT(1) packets in a different way than those with
ECT(0) or CE.
Reproducer:
Create two netns, connected with a veth:
$ ip netns add ns0
$ ip netns add ns1
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10/32 dev veth01
$ ip -netns ns1 address add 192.0.2.11/32 dev veth10
Add a route to ns1 in ns0:
$ ip -netns ns0 route add 192.0.2.11/32 dev veth01
In ns1, only packets with TOS 4 can be routed to ns0:
$ ip -netns ns1 route add 192.0.2.10/32 tos 4 dev veth10
Ping from ns0 to ns1 works regardless of the ECN bits, as long as TOS
is 4:
$ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1)
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0)
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE
... 0% packet loss ...
Now use iptable's rpfilter module in ns1:
$ ip netns exec ns1 iptables-legacy -t raw -A PREROUTING -m rpfilter --invert -j DROP
Not-ECT and ECT(1) packets still pass:
$ ip netns exec ns0 ping -Q 4 192.0.2.11 # TOS 4, Not-ECT
... 0% packet loss ...
$ ip netns exec ns0 ping -Q 5 192.0.2.11 # TOS 4, ECT(1)
... 0% packet loss ...
But ECT(0) and ECN packets are dropped:
$ ip netns exec ns0 ping -Q 6 192.0.2.11 # TOS 4, ECT(0)
... 100% packet loss ...
$ ip netns exec ns0 ping -Q 7 192.0.2.11 # TOS 4, CE
... 100% packet loss ...
After this patch, rpfilter doesn't drop ECT(0) and CE packets anymore.
Guillaume Nault [Sat, 16 Jan 2021 10:44:22 +0000 (11:44 +0100)]
udp: mask TOS bits in udp_v4_early_demux()
udp_v4_early_demux() is the only function that calls
ip_mc_validate_source() with a TOS that hasn't been masked with
IPTOS_RT_MASK.
This results in different behaviours for incoming multicast UDPv4
packets, depending on if ip_mc_validate_source() is called from the
early-demux path (udp_v4_early_demux) or from the regular input path
(ip_route_input_noref).
ECN would normally not be used with UDP multicast packets, so the
practical consequences should be limited on that side. However,
IPTOS_RT_MASK is used to also masks the TOS' high order bits, to align
with the non-early-demux path behaviour.
Reproducer:
Setup two netns, connected with veth:
$ ip netns add ns0
$ ip netns add ns1
$ ip -netns ns0 link set dev lo up
$ ip -netns ns1 link set dev lo up
$ ip link add name veth01 netns ns0 type veth peer name veth10 netns ns1
$ ip -netns ns0 link set dev veth01 up
$ ip -netns ns1 link set dev veth10 up
$ ip -netns ns0 address add 192.0.2.10 peer 192.0.2.11/32 dev veth01
$ ip -netns ns1 address add 192.0.2.11 peer 192.0.2.10/32 dev veth10
In ns0, add route to multicast address 224.0.2.0/24 using source
address 198.51.100.10:
$ ip -netns ns0 address add 198.51.100.10/32 dev lo
$ ip -netns ns0 route add 224.0.2.0/24 dev veth01 src 198.51.100.10
In ns1, define route to 198.51.100.10, only for packets with TOS 4:
$ ip -netns ns1 route add 198.51.100.10/32 tos 4 dev veth10
Also activate rp_filter in ns1, so that incoming packets not matching
the above route get dropped:
$ ip netns exec ns1 sysctl -wq net.ipv4.conf.veth10.rp_filter=1
Now try to receive packets on 224.0.2.11:
$ ip netns exec ns1 socat UDP-RECVFROM:1111,ip-add-membership=224.0.2.11:veth10,ignoreeof -
In ns0, send packet to 224.0.2.11 with TOS 4 and ECT(0) (that is,
tos 6 for socat):
$ echo test0 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
The "test0" message is properly received by socat in ns1, because
early-demux has no cached dst to use, so source address validation
is done by ip_route_input_mc(), which receives a TOS that has the
ECN bits masked.
Now send another packet to 224.0.2.11, still with TOS 4 and ECT(0):
$ echo test1 | ip netns exec ns0 socat - UDP-DATAGRAM:224.0.2.11:1111,bind=:1111,tos=6
The "test1" message isn't received by socat in ns1, because, now,
early-demux has a cached dst to use and calls ip_mc_validate_source()
immediately, without masking the ECN bits.
Fixes: 2fc6a5e50fd7 ("udp: perform source validation for mcast early demux") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>