As discussed at LPC 2019 ([0]), this patch brings (a quite belated) support
for declarative BTF-defined map-in-map support in libbpf. It allows to define
ARRAY_OF_MAPS and HASH_OF_MAPS BPF maps without any user-space initialization
code involved.
Additionally, it allows to initialize outer map's slots with references to
respective inner maps at load time, also completely declaratively.
Despite a weak type system of C, the way BTF-defined map-in-map definition
works, it's actually quite hard to accidentally initialize outer map with
incompatible inner maps. This being C, of course, it's still possible, but
even that would be caught at load time and error returned with helpful debug
log pointing exactly to the slot that failed to be initialized.
As an example, here's a rather advanced HASH_OF_MAPS declaration and
initialization example, filling slots #0 and #4 with two inner maps:
Here's the relevant part of libbpf debug log showing pretty clearly of what's
going on with map-in-map initialization:
libbpf: .maps relo #0: for 6 value 0 rel.r_offset 96 name 260 ('inner_map1')
libbpf: .maps relo #0: map 'outer_arr' slot [0] points to map 'inner_map1'
libbpf: .maps relo #1: for 7 value 32 rel.r_offset 112 name 249 ('inner_map2')
libbpf: .maps relo #1: map 'outer_arr' slot [2] points to map 'inner_map2'
libbpf: .maps relo #2: for 7 value 32 rel.r_offset 144 name 249 ('inner_map2')
libbpf: .maps relo #2: map 'outer_hash' slot [0] points to map 'inner_map2'
libbpf: .maps relo #3: for 6 value 0 rel.r_offset 176 name 260 ('inner_map1')
libbpf: .maps relo #3: map 'outer_hash' slot [4] points to map 'inner_map1'
libbpf: map 'inner_map1': created successfully, fd=4
libbpf: map 'inner_map2': created successfully, fd=5
libbpf: map 'outer_hash': created successfully, fd=7
libbpf: map 'outer_hash': slot [0] set to map 'inner_map2' fd=5
libbpf: map 'outer_hash': slot [4] set to map 'inner_map1' fd=4
Notice from the log above that fd=6 (not logged explicitly) is used for inner
"prototype" map, necessary for creation of outer map. It is destroyed
immediately after outer map is created.
See also included selftest with some extra comments explaining extra details
of usage. Additionally, similar initialization syntax and libbpf functionality
can be used to do initialization of BPF_PROG_ARRAY with references to BPF
sub-programs. This can be done in follow up patches, if there will be a demand
for this.
libbpf: Refactor map creation logic and fix cleanup leak
Factor out map creation and destruction logic to simplify code and especially
error handling. Also fix map FD leak in case of partially successful map
creation during bpf_object load operation.
====================
This patch series adds various observability APIs to bpf_link:
- each bpf_link now gets ID, similar to bpf_map and bpf_prog, by which
user-space can iterate over all existing bpf_links and create limited FD
from ID;
- allows to get extra object information with bpf_link general and
type-specific information;
- implements `bpf link show` command which lists all active bpf_links in the
system;
- implements `bpf link pin` allowing to pin bpf_link by ID or from other
pinned path.
v2->v3:
- improve spin locking around bpf_link ID (Alexei);
- simplify bpf_link_info handling and fix compilation error on sh arch;
v1->v2:
- simplified `bpftool link show` implementation (Quentin);
- fixed formatting of bpftool-link.rst (Quentin);
- fixed attach type printing logic (Quentin);
rfc->v1:
- dropped read-only bpf_links (Alexei);
- fixed bug in bpf_link_cleanup() not removing ID;
- fixed bpftool link pinning search logic;
- added bash-completion and man page.
====================
bpftool: Expose attach_type-to-string array to non-cgroup code
Move attach_type_strings into main.h for access in non-cgroup code.
bpf_attach_type is used for non-cgroup attach types quite widely now. So also
complete missing string translations for non-cgroup attach types.
bpf: Add support for BPF_OBJ_GET_INFO_BY_FD for bpf_link
Add ability to fetch bpf_link details through BPF_OBJ_GET_INFO_BY_FD command.
Also enhance show_fdinfo to potentially include bpf_link type-specific
information (similarly to obj_info).
Also introduce enum bpf_link_type stored in bpf_link itself and expose it in
UAPI. bpf_link_tracing also now will store and return bpf_attach_type.
bpf: Support GET_FD_BY_ID and GET_NEXT_ID for bpf_link
Add support to look up bpf_link by ID and iterate over all existing bpf_links
in the system. GET_FD_BY_ID code handles not-yet-ready bpf_link by checking
that its ID hasn't been set to non-zero value yet. Setting bpf_link's ID is
done as the very last step in finalizing bpf_link, together with installing
FD. This approach allows users of bpf_link in kernel code to not worry about
races between user-space and kernel code that hasn't finished attaching and
initializing bpf_link.
Generate ID for each bpf_link using IDR, similarly to bpf_map and bpf_prog.
bpf_link creation, initialization, attachment, and exposing to user-space
through FD and ID is a complicated multi-step process, abstract it away
through bpf_link_primer and bpf_link_prime(), bpf_link_settle(), and
bpf_link_cleanup() internal API. They guarantee that until bpf_link is
properly attached, user-space won't be able to access partially-initialized
bpf_link either from FD or ID. All this allows to simplify bpf_link attachment
and error handling code.
Make bpf_link update support more generic by making it into another
bpf_link_ops methods. This allows generic syscall handling code to be agnostic
to various conditionally compiled features (e.g., the case of
CONFIG_CGROUP_BPF). This also allows to keep link type-specific code to remain
static within respective code base. Refactor existing bpf_cgroup_link code and
take advantage of this.
selftests/bpf: Copy runqslower to OUTPUT directory
$(OUTPUT)/runqslower makefile target doesn't actually create runqslower
binary in the $(OUTPUT) directory. As lib.mk expects all
TEST_GEN_PROGS_EXTENDED (which runqslower is a part of) to be present in
the OUTPUT directory, this results in an error when running e.g. `make
install`:
rsync: link_stat "tools/testing/selftests/bpf/runqslower" failed: No
such file or directory (2)
Copy the binary into the OUTPUT directory after building it to fix the
error.
Daniel Borkmann [Tue, 28 Apr 2020 19:20:20 +0000 (21:20 +0200)]
Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from userspace in common code. This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.
As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.
Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
sysctl: remove all extern declaration from sysctl.c
Extern declarations in .c files are a bad style and can lead to
mismatches. Use existing definitions in headers where they exist,
and otherwise move the external declarations to suitable header
files.
Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
====================
We've been developing an in-house L4 load balancer based on XDP
and TC for a while. Following Alexei's call for more up-to-date examples of
production BPF in the kernel tree [1], Cloudflare is making this available
under dual GPL-2.0 or BSD 3-clause terms.
The code requires at least v5.3 to function correctly.
cls_redirect is a TC clsact based replacement for the glb-redirect iptables
module available at [1]. It enables what GitHub calls "second chance"
flows [2], similarly proposed by the Beamer paper [3]. In contrast to
glb-redirect, it also supports migrating UDP flows as long as connected
sockets are used. cls_redirect is in production at Cloudflare, as part of
our own L4 load balancer.
We have modified the encapsulation format slightly from glb-redirect:
glbgue_chained_routing.private_data_type has been repurposed to form a
version field and several flags. Both have been arranged in a way that
a private_data_type value of zero matches the current glb-redirect
behaviour. This means that cls_redirect will understand packets in
glb-redirect format, but not vice versa.
The test suite only covers basic features. For example, cls_redirect will
correctly forward path MTU discovery packets, but this is not exercised.
It is also possible to switch the encapsulation format to GRE on the last
hop, which is also not tested.
There are two major distinctions from glb-redirect: first, cls_redirect
relies on receiving encapsulated packets directly from a router. This is
because we don't have access to the neighbour tables from BPF, yet. See
forward_to_next_hop for details. Second, cls_redirect performs decapsulation
instead of using separate ipip and sit tunnel devices. This
avoids issues with the sit tunnel [4] and makes deploying the classifier
easier: decapsulated packets appear on the same interface, so existing
firewall rules continue to work as expected.
The code base started it's life on v4.19, so there are most likely still
hold overs from old workarounds. In no particular order:
- The function buf_off is required to defeat a clang optimization
that leads to the verifier rejecting the program due to pointer
arithmetic in the wrong order.
- The function pkt_parse_ipv6 is force inlined, because it would
otherwise be rejected due to returning a pointer to stack memory.
- The functions fill_tuple and classify_tcp contain kludges, because
we've run out of function arguments.
- The logic in general is rather nested, due to verifier restrictions.
I think this is either because the verifier loses track of constants
on the stack, or because it can't track enum like variables.
To make BPF verifier verbose log more releavant and easier to use to debug
verification failures, "pop" parts of log that were successfully verified.
This has effect of leaving only verifier logs that correspond to code branches
that lead to verification failure, which in practice should result in much
shorter and more relevant verifier log dumps. This behavior is made the
default behavior and can be overriden to do exhaustive logging by specifying
BPF_LOG_LEVEL2 log level.
Using BPF_LOG_LEVEL2 to disable this behavior is not ideal, because in some
cases it's good to have BPF_LOG_LEVEL2 per-instruction register dump
verbosity, but still have only relevant verifier branches logged. But for this
patch, I didn't want to add any new flags. It might be worth-while to just
rethink how BPF verifier logging is performed and requested and streamline it
a bit. But this trimming of successfully verified branches seems to be useful
and a good default behavior.
To test this, I modified runqslower slightly to introduce read of
uninitialized stack variable. Log (**truncated in the middle** to save many
lines out of this commit message) BEFORE this change:
Notice how there is a successful code path from instruction 0 through 67, few
successfully verified jumps (44->66, 34->66), and only after that 11->28 jump
plus error on instruction #32.
AFTER this change (full verifier log, **no truncation**):
Notice how in this case, there are 0-11 instructions + jump from 11 to
28 is recorded + 28-32 instructions with error on insn #32.
test_verifier test runner was updated to specify BPF_LOG_LEVEL2 for
VERBOSE_ACCEPT expected result due to potentially "incomplete" success verbose
log at BPF_LOG_LEVEL1.
On success, verbose log will only have a summary of number of processed
instructions, etc, but no branch tracing log. Having just a last succesful
branch tracing seemed weird and confusing. Having small and clean summary log
in success case seems quite logical and nice, though.
On a device like a cellphone which is constantly suspending
and resuming CLOCK_MONOTONIC is not particularly useful for
keeping track of or reacting to external network events.
Instead you want to use CLOCK_BOOTTIME.
Hence add bpf_ktime_get_boot_ns() as a mirror of bpf_ktime_get_ns()
based around CLOCK_BOOTTIME instead of CLOCK_MONOTONIC.
Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
and this was presumably marked GPL due to kernel/time/timekeeping.c:
EXPORT_SYMBOL_GPL(ktime_get_mono_fast_ns);
and while that may make sense for kernel modules (although even that
is doubtful), there is currently AFAICT no other source of time
available to ebpf.
Furthermore this is really just equivalent to clock_gettime(CLOCK_MONOTONIC)
which is exposed to userspace (via vdso even to make it performant)...
As such, I see no reason to keep the GPL restriction.
(In the future I'd like to have access to time from Apache licensed ebpf code)
Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Lorenzo Colitti [Mon, 20 Apr 2020 18:34:08 +0000 (11:34 -0700)]
net: bpf: Allow TC programs to call BPF_FUNC_skb_change_head
This allows TC eBPF programs to modify and forward (redirect) packets
from interfaces without ethernet headers (for example cellular)
to interfaces with (for example ethernet/wifi).
The lack of this appears to simply be an oversight.
Tested:
in active use in Android R on 4.14+ devices for ipv6
cellular to wifi tethering offload.
Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
bpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=n
linux-next build bot reported compile issue [1] with one of its
configs. It looks like when we have CONFIG_NET=n and
CONFIG_BPF{,_SYSCALL}=y, we are missing the bpf_base_func_proto
definition (from net/core/filter.c) in cgroup_base_func_proto.
I'm reshuffling the code a bit to make it work. The common helpers
are moved into kernel/bpf/helpers.c and the bpf_base_func_proto is
exported from there.
Also, bpf_get_raw_cpu_id goes into kernel/bpf/core.c akin to existing
bpf_user_rnd_u32.
Luke Nelson [Tue, 21 Apr 2020 00:28:04 +0000 (17:28 -0700)]
bpf, riscv: Fix tail call count off by one in RV32 BPF JIT
This patch fixes an off by one error in the RV32 JIT handling for BPF
tail call. Currently, the code decrements TCC before checking if it
is less than zero. This limits the maximum number of tail calls to 32
instead of 33 as in other JITs. The fix is to instead check the old
value of TCC before decrementing.
bpf_helpers.h: Add note for building with vmlinux.h or linux/types.h
The following error was shown when a bpf program was compiled without
vmlinux.h auto-generated from BTF:
# clang -I./linux/tools/lib/ -I/lib/modules/$(uname -r)/build/include/ \
-O2 -Wall -target bpf -emit-llvm -c bpf_prog.c -o bpf_prog.bc
...
In file included from linux/tools/lib/bpf/bpf_helpers.h:5:
linux/tools/lib/bpf/bpf_helper_defs.h:56:82: error: unknown type name '__u64'
...
It seems that bpf programs are intended for being built together with
the vmlinux.h (which will have all the __u64 and other typedefs). But
users may mistakenly think "include <linux/types.h>" is missing
because the vmlinux.h is not common for non-bpf developers. IMO, an
explicit comment therefore should be added to bpf_helpers.h as this
patch shows.
bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}
Currently the following prog types don't fall back to bpf_base_func_proto()
(instead they have cgroup_base_func_proto which has a limited set of
helpers from bpf_base_func_proto):
* BPF_PROG_TYPE_CGROUP_DEVICE
* BPF_PROG_TYPE_CGROUP_SYSCTL
* BPF_PROG_TYPE_CGROUP_SOCKOPT
I don't see any specific reason why we shouldn't use bpf_base_func_proto(),
every other type of program (except bpf-lirc and, understandably, tracing)
use it, so let's fall back to bpf_base_func_proto for those prog types
as well.
This basically boils down to adding access to the following helpers:
* BPF_FUNC_get_prandom_u32
* BPF_FUNC_get_smp_processor_id
* BPF_FUNC_get_numa_node_id
* BPF_FUNC_tail_call
* BPF_FUNC_ktime_get_ns
* BPF_FUNC_spin_lock (CAP_SYS_ADMIN)
* BPF_FUNC_spin_unlock (CAP_SYS_ADMIN)
* BPF_FUNC_jiffies64 (CAP_SYS_ADMIN)
I've also added bpf_perf_event_output() because it's really handy for
logging and debugging.
net: openvswitch: use div_u64() for 64-by-32 divisions
Compile the kernel for arm 32 platform, the build warning found.
To fix that, should use div_u64() for divisions.
| net/openvswitch/meter.c:396: undefined reference to `__udivdi3'
[add more commit msg, change reported tag, and use div_u64 instead
of do_div by Tonghao]
Fixes: e57358873bb5d6ca ("net: openvswitch: use u64 for meter bucket") Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Tested-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: openvswitch: suitable access to the dp_meters
To fix the following sparse warning:
| net/openvswitch/meter.c:109:38: sparse: sparse: incorrect type
| in assignment (different address spaces) ...
| net/openvswitch/meter.c:720:45: sparse: sparse: incorrect type
| in argument 1 (different address spaces) ...
Reported-by: kbuild test robot <lkp@intel.com> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 26 Apr 2020 03:46:28 +0000 (20:46 -0700)]
Merge branch 'hinic-add-SR-IOV-support'
Luo bin says:
====================
hinic: add SR-IOV support
patch #1 adds mailbox channel support and vf can
communicate with pf or hw through it.
patch #2 adds support for enabling vf and tx/rx
capabilities based on vf.
patch #3 adds support for vf's basic configurations.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Zou Wei [Fri, 24 Apr 2020 13:53:14 +0000 (21:53 +0800)]
net/mlx4_core: Add missing iounmap() in error path
This fixes the following coccicheck warning:
drivers/net/ethernet/mellanox/mlx4/crdump.c:200:2-8: ERROR: missing iounmap;
ioremap on line 190 and execution via conditional on line 198
Fixes: 7ef19d3b1d5e ("devlink: report error once U32_MAX snapshot ids have been used") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zou Wei <zou_wei@huawei.com> Reviewed-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Yingliang [Fri, 24 Apr 2020 12:52:26 +0000 (20:52 +0800)]
ptp: clockmatrix: remove unnecessary comparison
The type of loaddr is u8 which is always '<=' 0xff, so the
loaddr <= 0xff is always true, we can remove this comparison.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Vincent Cheng <vincent.cheng.xh@renesas.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: mptcp: use mptcp receive buffer space to select rcv window
In MPTCP, the receive window is shared across all subflows, because it
refers to the mptcp-level sequence space.
MPTCP receivers already place incoming packets on the mptcp socket
receive queue and will charge it to the mptcp socket rcvbuf until
userspace consumes the data.
Update __tcp_select_window to use the occupancy of the parent/mptcp
socket instead of the subflow socket in case the tcp socket is part
of a logical mptcp connection.
This commit doesn't change choice of initial window for passive or active
connections.
While it would be possible to change those as well, this adds complexity
(especially when handling MP_JOIN requests). Furthermore, the MPTCP RFC
specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
of a TCP segment at the connection level if it does not also carry a DSS
option with a Data ACK field.'
SYN/SYNACK packets do not carry a DSS option with a Data ACK field.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 26 Apr 2020 03:29:44 +0000 (20:29 -0700)]
Merge branch 'net-hns3-refactor-for-MAC-table'
Huazhong Tan says:
====================
net: hns3: refactor for MAC table
This patchset refactors the MAC table management, configure
the MAC address asynchronously, instead of synchronously.
Base on this change, it also refines the handle of promisc
mode and filter table entries restoring after reset.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: hns3: optimize the filter table entries handling when resetting
Currently, the PF driver removes all (including its VFs') MAC/VLAN
flow director table entries when resetting, and restores them after
reset completed.
In fact, the hardware will clear all table entries only in IMP
reset and global reset. So driver only needs to restore the table
entries in these cases, and needs do nothing when PF reset, FLR
or other function level reset.
This patch optimizes it by removing unnecessary table entries clear
and restoring handling in the reset flow, and doing the restoring
after reset completed.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: hns3: use mutex vport_lock instead of mutex umv_lock
Currently, the driver use mutex umv_lock to protect the variable
vport->share_umv_size. And there is already a mutex vport_lock
being defined in the driver, which is designed to protect the
resource of vport. So we can use vport_lock instead of umv_lock.
Furthermore, there is a time window for protect share_umv_size
between checking UMV space and doing MAC configuration in the
lin function hclge_add_uc_addr_common(). It should be extended.
This patch uses mutex vport_lock intead of spin lock umv_lock to
protect share_umv_size, and adjusts the mutex's range.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
As the HNS3 driver doesn't update the MAC address directly in
function hns3_set_rx_mode() now, it can't know whether the
MAC table is full from __dev_uc_sync() and __dev_mc_sync(),
so it's senseless to handle the overflow promisc here.
This patch removes the handle of overflow promisc from function
hns3_set_rx_mode(), and updates the promisc mode in the service
task.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: hns3: add support for dumping UC and MC MAC list
This patch adds support for dumping entries of UC and MC MAC list,
which help checking whether a MAC address being added into hardware
or not.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the HNS3 driver sync and unsync MAC address in function
hns3_set_rx_mode(). For PF, it adds and deletes MAC address directly
in the path of dev_set_rx_mode(). If failed, it won't retry until
next calling of hns3_set_rx_mode(). On the other hand, if request
add and remove a same address many times at a short interval, each
request must be done one by one, can't be merged. For VF, it sends
mailbox messages to PF to request adding or deleting MAC address in
the path of function hns3_set_rx_mode(), no matter the address is
configured success.
This patch refines it by recording the MAC address in function
hns3_set_rx_mode(), and updating MAC address in the service task.
If failed, it will retry by the next calling of periodical service
task. It also uses some state to mark the state of each MAC address
in the MAC list, which can help merge configure request for a same
address. With these changes, when global reset or IMP reset occurs,
we can restore the MAC table with the MAC list.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: hns3: replace num_req_vfs with num_alloc_vport in hclge_reset_umv_space()
Like the calculation elsewhere, replaces num_req_vfs with
num_alloc_vport in hclge_reset_umv_space().
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: hns3: remove unnecessary parameter 'is_alloc' in hclge_set_umv_space()
Since hclge_set_umv_space() is only called by hclge_init_umv_space(),
so parameter 'is_alloc' is redundant.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: hns3: refine for unicast MAC VLAN space management
Currently, firmware helps manage the unicast MAC VLAN table
space for each PF. PF just needs to tell firmware its wanted
space when initializing, and unnecessary to free it when
un-intializing. So this patch removes the umv space free handle,
and removes the forward statement of hclge_set_umv_space()
by defining hclge_init_umv_space() after it.
Signed-off-by: Jian Shen <shenjian15@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'sched-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Ingo Molnar:
"Misc fixes:
- an uclamp accounting fix
- three frequency invariance fixes and a readability improvement"
* tag 'sched-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/core: Fix reset-on-fork from RT with uclamp
x86, sched: Move check for CPU type to caller function
x86, sched: Don't enable static key when starting secondary CPUs
x86, sched: Account for CPUs with less than 4 cores in freq. invariance
x86, sched: Bail out of frequency invariance if base frequency is unknown
Merge tag 'perf-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Ingo Molnar:
"Two changes:
- fix exit event records
- extend x86 PMU driver enumeration to add Intel Jasper Lake CPU
support"
* tag 'perf-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/core: fix parent pid/tid in task exit events
perf/x86/cstate: Add Jasper Lake CPU support
Merge tag 'objtool-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Ingo Molnar:
"Two fixes: fix an off-by-one bug, and fix 32-bit builds on 64-bit
systems"
* tag 'objtool-urgent-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Fix off-by-one in symbol_by_offset()
objtool: Fix 32bit cross builds
1) Fix memory leak in netfilter flowtable, from Roi Dayan.
2) Ref-count leaks in netrom and tipc, from Xiyu Yang.
3) Fix warning when mptcp socket is never accepted before close, from
Florian Westphal.
4) Missed locking in ovs_ct_exit(), from Tonghao Zhang.
5) Fix large delays during PTP synchornization in cxgb4, from Rahul
Lakkireddy.
6) team_mode_get() can hang, from Taehee Yoo.
7) Need to use kvzalloc() when allocating fw tracer in mlx5 driver,
from Niklas Schnelle.
8) Fix handling of bpf XADD on BTF memory, from Jann Horn.
9) Fix BPF_STX/BPF_B encoding in x86 bpf jit, from Luke Nelson.
10) Missing queue memory release in iwlwifi pcie code, from Johannes
Berg.
11) Fix NULL deref in macvlan device event, from Taehee Yoo.
12) Initialize lan87xx phy correctly, from Yuiko Oshino.
13) Fix looping between VRF and XFRM lookups, from David Ahern.
14) etf packet scheduler assumes all sockets are full sockets, which is
not necessarily true. From Eric Dumazet.
15) Fix mptcp data_fin handling in RX path, from Paolo Abeni.
16) fib_select_default() needs to handle nexthop objects, from David
Ahern.
17) Use GFP_ATOMIC under spinlock in mac80211_hwsim, from Wei Yongjun.
18) vxlan and geneve use wrong nlattr array, from Sabrina Dubroca.
19) Correct rx/tx stats in bcmgenet driver, from Doug Berger.
20) BPF_LDX zero-extension is encoded improperly in x86_32 bpf jit, fix
from Luke Nelson.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (100 commits)
selftests/bpf: Fix a couple of broken test_btf cases
tools/runqslower: Ensure own vmlinux.h is picked up first
bpf: Make bpf_link_fops static
bpftool: Respect the -d option in struct_ops cmd
selftests/bpf: Add test for freplace program with expected_attach_type
bpf: Propagate expected_attach_type when verifying freplace programs
bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd
bpf, x86_32: Fix logic error in BPF_LDX zero-extension
bpf, x86_32: Fix clobbering of dst for BPF_JSET
bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extension
bpf: Fix reStructuredText markup
net: systemport: suppress warnings on failed Rx SKB allocations
net: bcmgenet: suppress warnings on failed Rx SKB allocations
macsec: avoid to set wrong mtu
mac80211: sta_info: Add lockdep condition for RCU list usage
mac80211: populate debugfs only after cfg80211 init
net: bcmgenet: correct per TX/RX ring statistics
net: meth: remove spurious copyright text
net: phy: bcm84881: clear settings on link down
chcr: Fix CPU hard lockup
...
selftests/bpf: Fix a couple of broken test_btf cases
Commit 51c39bb1d5d1 ("bpf: Introduce function-by-function verification")
introduced function linkage flag and changed the error message from
"vlen != 0" to "Invalid func linkage" and broke some fake BPF programs.
Adjust the test accordingly.
AFACT, the programs don't really need any arguments and only look
at BTF for maps, so let's drop the args altogether.
Before:
BTF raw test[103] (func (Non zero vlen)): do_test_raw:3703:FAIL expected
err_str:vlen != 0
magic: 0xeb9f
version: 1
flags: 0x0
hdr_len: 24
type_off: 0
type_len: 72
str_off: 72
str_len: 10
btf_total_size: 106
[1] INT (anon) size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[2] INT (anon) size=4 bits_offset=0 nr_bits=32 encoding=(none)
[3] FUNC_PROTO (anon) return=0 args=(1 a, 2 b)
[4] FUNC func type_id=3 Invalid func linkage
BTF libbpf test[1] (test_btf_haskv.o): libbpf: load bpf program failed:
Invalid argument
libbpf: -- BEGIN DUMP LOG ---
libbpf:
Validating test_long_fname_2() func#1...
Arg#0 type PTR in test_long_fname_2() is not supported yet.
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0
libbpf: -- END LOG --
libbpf: failed to load program 'dummy_tracepoint'
libbpf: failed to load object 'test_btf_haskv.o'
do_test_file:4201:FAIL bpf_object__load: -4007
BTF libbpf test[2] (test_btf_newkv.o): libbpf: load bpf program failed:
Invalid argument
libbpf: -- BEGIN DUMP LOG ---
libbpf:
Validating test_long_fname_2() func#1...
Arg#0 type PTR in test_long_fname_2() is not supported yet.
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0
libbpf: -- END LOG --
libbpf: failed to load program 'dummy_tracepoint'
libbpf: failed to load object 'test_btf_newkv.o'
do_test_file:4201:FAIL bpf_object__load: -4007
BTF libbpf test[3] (test_btf_nokv.o): libbpf: load bpf program failed:
Invalid argument
libbpf: -- BEGIN DUMP LOG ---
libbpf:
Validating test_long_fname_2() func#1...
Arg#0 type PTR in test_long_fname_2() is not supported yet.
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0
peak_states 0 mark_read 0
libbpf: -- END LOG --
libbpf: failed to load program 'dummy_tracepoint'
libbpf: failed to load object 'test_btf_nokv.o'
do_test_file:4201:FAIL bpf_object__load: -4007
tools/runqslower: Ensure own vmlinux.h is picked up first
Reorder include paths to ensure that runqslower sources are picking up
vmlinux.h, generated by runqslower's own Makefile. When runqslower is built
from selftests/bpf, due to current -I$(BPF_INCLUDE) -I$(OUTPUT) ordering, it
might pick up not-yet-complete vmlinux.h, generated by selftests Makefile,
which could lead to compilation errors like [0]. So ensure that -I$(OUTPUT)
goes first and rely on runqslower's Makefile own dependency chain to ensure
vmlinux.h is properly completed before source code relying on it is compiled.
selftests/bpf: Add test for freplace program with expected_attach_type
This adds a new selftest that tests the ability to attach an freplace
program to a program type that relies on the expected_attach_type of the
target program to pass verification.
bpf: Propagate expected_attach_type when verifying freplace programs
For some program types, the verifier relies on the expected_attach_type of
the program being verified in the verification process. However, for
freplace programs, the attach type was not propagated along with the
verifier ops, so the expected_attach_type would always be zero for freplace
programs.
This in turn caused the verifier to sometimes make the wrong call for
freplace programs. For all existing uses of expected_attach_type for this
purpose, the result of this was only false negatives (i.e., freplace
functions would be rejected by the verifier even though they were valid
programs for the target they were replacing). However, should a false
positive be introduced, this can lead to out-of-bounds accesses and/or
crashes.
The fix introduced in this patch is to propagate the expected_attach_type
to the freplace program during verification, and reset it after that is
done.
Luke Nelson [Wed, 22 Apr 2020 17:36:30 +0000 (10:36 -0700)]
bpf, x86_32: Fix clobbering of dst for BPF_JSET
The current JIT clobbers the destination register for BPF_JSET BPF_X
and BPF_K by using "and" and "or" instructions. This is fine when the
destination register is a temporary loaded from a register stored on
the stack but not otherwise.
This patch fixes the problem (for both BPF_K and BPF_X) by always loading
the destination register into temporaries since BPF_JSET should not
modify the destination register.
This bug may not be currently triggerable as BPF_REG_AX is the only
register not stored on the stack and the verifier uses it in a limited
way.
Fixes: 03f5781be2c7b ("bpf, x86_32: add eBPF JIT compiler for ia32") Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Wang YanQing <udknight@gmail.com> Link: https://lore.kernel.org/bpf/20200422173630.8351-2-luke.r.nels@gmail.com
Luke Nelson [Wed, 22 Apr 2020 17:36:29 +0000 (10:36 -0700)]
bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extension
The current JIT uses the following sequence to zero-extend into the
upper 32 bits of the destination register for BPF_LDX BPF_{B,H,W},
when the destination register is not on the stack:
EMIT3(0xC7, add_1reg(0xC0, dst_hi), 0);
The problem is that C7 /0 encodes a MOV instruction that requires a 4-byte
immediate; the current code emits only 1 byte of the immediate. This
means that the first 3 bytes of the next instruction will be treated as
the rest of the immediate, breaking the stream of instructions.
This patch fixes the problem by instead emitting "xor dst_hi,dst_hi"
to clear the upper 32 bits. This fixes the problem and is more efficient
than using MOV to load a zero immediate.
This bug may not be currently triggerable as BPF_REG_AX is the only
register not stored on the stack and the verifier uses it in a limited
way, and the verifier implements a zero-extension optimization. But the
JIT should avoid emitting incorrect encodings regardless.
Fixes: 03f5781be2c7b ("bpf, x86_32: add eBPF JIT compiler for ia32") Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: Luke Nelson <luke.r.nels@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Reviewed-by: H. Peter Anvin (Intel) <hpa@zytor.com> Acked-by: Wang YanQing <udknight@gmail.com> Link: https://lore.kernel.org/bpf/20200422173630.8351-1-luke.r.nels@gmail.com
Yang Yingliang [Fri, 24 Apr 2020 08:03:48 +0000 (16:03 +0800)]
ptp: idt82p33: remove unnecessary comparison
The type of loaddr is u8 which is always '<=' 0xff, so the
loaddr <= 0xff is always true, we can remove this comparison.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Zheng Bin <zhengbin13@huawei.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: systemport: suppress warnings on failed Rx SKB allocations
The driver is designed to drop Rx packets and reclaim the buffers
when an allocation fails, and the network interface needs to safely
handle this packet loss. Therefore, an allocation failure of Rx
SKBs is relatively benign.
However, the output of the warning message occurs with a high
scheduling priority that can cause excessive jitter/latency for
other high priority processing.
This commit suppresses the warning messages to prevent scheduling
problems while retaining the failure count in the statistics of
the network interface.
Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bcmgenet: suppress warnings on failed Rx SKB allocations
The driver is designed to drop Rx packets and reclaim the buffers
when an allocation fails, and the network interface needs to safely
handle this packet loss. Therefore, an allocation failure of Rx
SKBs is relatively benign.
However, the output of the warning message occurs with a high
scheduling priority that can cause excessive jitter/latency for
other high priority processing.
This commit suppresses the warning messages to prevent scheduling
problems while retaining the failure count in the statistics of
the network interface.
Signed-off-by: Doug Berger <opendmb@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: clear phydev->suspended after soft reset
If a soft reset is triggered whilst PHY is in power-down, then
phydev->suspended will remain set. Seems we didn't face any issue yet
caused by this, but better reset the suspended flag after soft reset.
See also the following from 22.2.4.1.1
Resetting a PHY is accomplished by setting bit 0.15 to a logic one.
This action shall set the status and control registers to their default
states.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Since 6e2d85ec0559 ("net: phy: Stop with excessive soft reset")
we don't need genphy_no_soft_reset() any longer. Not setting
callback soft_reset results in a no-op now.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Move the callback into the phylink_config structure, rather than
providing a callback to set this up.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Tested-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 24 Apr 2020 23:44:55 +0000 (16:44 -0700)]
Merge branch 'qdisc-noop'
Jesper Dangaard Brouer says:
====================
Fix qdisc noop issue caused by driver and identify future bugs
I've been very puzzled why networking on my NXP development board,
using driver dpaa2-eth, stopped working when I updated the kernel
version >= 5.3. The observable issue were that interface would drop
all TX packets, because it had assigned the qdisc noop.
This turned out the be a NIC driver bug, that would only get triggered
when using sysctl net/core/default_qdisc=fq_codel. It was non-trivial
to find out[1] this was driver related. Thus, this patchset besides
fixing the driver bug, also helps end-user identify the issue.
Drivers ndo_setup_tc call should return -EOPNOTSUPP, when it cannot
support the qdisc type. Other return values will result in failing the
qdisc setup. This lead to qdisc noop getting assigned, which will
drop all TX packets on the interface.
Fixes: ab1e6de2bd49 ("dpaa2-eth: Add mqprio support") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: sched: report ndo_setup_tc failures via extack
Help end-users of the 'tc' command to see if the drivers ndo_setup_tc
function call fails. Troubleshooting when this happens is non-trivial
(see full process here[1]), and results in net_device getting assigned
the 'qdisc noop', which will drop all TX packets on the interface.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When a macsec interface is created, the mtu is calculated with the lower
interface's mtu value.
If the mtu of lower interface is lower than the length, which is needed
by macsec interface, macsec's mtu value will be overflowed.
So, if the lower interface's mtu is too low, macsec interface's mtu
should be set to 0.
Test commands:
ip link add dummy0 mtu 10 type dummy
ip link add macsec0 link dummy0 type macsec
ip link show macsec0
Before:
11: macsec0@dummy0: <BROADCAST,MULTICAST,M-DOWN> mtu 4294967274
After:
11: macsec0@dummy0: <BROADCAST,MULTICAST,M-DOWN> mtu 0
Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two minor fixes: one to update a Kconfig reference and the other to
fix a resource leak on an error path in sg"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: Update referenced link to cdrtools
scsi: sg: add sg_remove_request in sg_write
David S. Miller [Fri, 24 Apr 2020 22:41:51 +0000 (15:41 -0700)]
Merge branch 'mlxsw-Mirroring-cleanups'
Ido Schimmel says:
====================
mlxsw: Mirroring cleanups
This patch set contains various cleanups in SPAN (mirroring) code
noticed by Amit and I while working on future enhancements in this area.
No functional changes intended. Tested by current mirroring selftests.
Patches #1-#2 from Amit reduce nesting in a certain function and rename
a callback to a more meaningful name.
Patch #3 removes debug prints that have little value.
Patch #4 converts a reference count to 'refcount_t' in order to catch
over/under flows.
Patch #5 replaces a zero-length array with flexible-array member in
order to get a compiler warning in case the flexible array does not
occur last in the structure.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>