====================
dt-bindings: net: convert sff,sfp to dtschema
This patch set converts the sff,sfp to dtschema.
The first patch does a somewhat mechanical conversion without changing
anything else beside the format in which the dt binding is presented.
In the second patch we rename some dt nodes to be generic. The last two
patches change the GPIO related properties so that they uses the -gpios
preferred suffix. This way, all the DTBs are passing the validation
against the sff,sfp.yaml binding.
====================
arch: arm64: dts: marvell: rename the sfp GPIO properties
Rename the GPIO related sfp properties to include the preffered -gpios
suffix. Also, with this change the dtb_check will no longer complain
when trying to verify the DTS against the sff,sfp.yaml binding.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
arch: arm64: dts: lx2160a-clearfog-itx: rename the sfp GPIO properties
Rename the 'mod-def0-gpio' property to 'mod-def0-gpios' so that we use
the preferred -gpios suffix. Also, with this change the dtb_check will
not complain when trying to verify the DTS against the sff,sfp.yaml
binding.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Following the commit 4bdc70d2af38 "devlink: hold the instance lock
during eswitch_mode callbacks" which takes devlink instance lock for all
devlink eswitch callbacks and adds a temporary workaround, this patchset
removes the workaround, replaces devlink API functions by devl_ API
where called from mlx5 driver eswitch callbacks flows and adds devlink
instance lock in other driver's path that leads to these functions.
While moving to devl_ API the patchset removes part of the devlink API
functions which mlx5 was the last one to use and so not used by any
driver now.
The patchset also remove DEVLINK_NL_FLAG_NO_LOCK flag from the callbacks
of port_new/port which are called only from mlx5 driver and the already
locked by the patchset as parallel paths to the eswitch callbacks using
devl_ API functions.
This patchset will be followed by another patchset that will remove
DEVLINK_NL_FLAG_NO_LOCK flag from devlink reload and devlink health
callbacks. Thus we will have all devlink callbacks locked and it will
pave the way to remove devlink mutex.
===================
devlink: Hold the instance lock in port_new / port_del callbacks
Let the core take the devlink instance lock around port_new and port_del
callbacks and remove the now redundant locking in the only driver that
currently use them.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net/mlx5: Remove devl_unlock from mlx5_devlink_eswitch_mode_set
The callback mlx5_devlink_eswitch_mode_set() had unlocked devlink as a
temporary workaround once devlink instance lock was added to devlink
eswitch callbacks. Now that all flows triggered by this function
that took devlink lock are using devl_ API and all parallel paths are
locked we can remove this workaround.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net/mlx5: Use devl_ API in mlx5e_devlink_port_register
As part of the flows invoked by mlx5_devlink_eswitch_mode_set() get to
mlx5_rescan_drivers_locked() which can call mlx5e_probe()/mlx5e_remove
and register/unregister mlx5e driver ports accordingly. This can lead to
deadlock once mlx5_devlink_eswitch_mode_set() will use devlink lock.
Use devl_port_register/unregister() instead of
devlink_port_register/unregister() and add devlink instance locks in the
driver paths to this function to have it locked while calling devl_ API
function.
If remove or probe were called by module init or module cleanup flows,
need to lock devlink just before calling devl_port_register(), otherwise
it is called by attach/detach or register/unregister flow and we can
have the flow locked. Added flag to distinguish between these cases.
This will be used by the downstream patch to invoke
mlx5_devlink_eswitch_mode_set() with devlink locked.
The previous patch removed the last usage of the functions
devlink_rate_leaf_create() and devlink_rate_nodes_destroy(). Thus,
remove these function from devlink API.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net/mlx5: Use devl_ API in mlx5_esw_devlink_sf_port_register
The function mlx5_esw_devlink_sf_port_register() calls
devlink_port_register() and devlink_rate_leaf_create(). Use devl_ API to
call devl_port_register() and devl_rate_leaf_create() accordingly and
add devlink instance lock in driver paths to this function.
Similarly, use devl_ API to call devl_port_unregister() and
devl_rate_leaf_destroy() in mlx5_esw_devlink_sf_port_unregister() and
ensure locking devlink instance lock on all the paths to this function
too.
This will be used by the downstream patch to invoke
mlx5_devlink_eswitch_mode_set() with devlink lock held.
Note this patch is taking devlink lock on mlx5_devlink_sf_port_new/del()
which are devlink callbacks for port_new/del(). We will take these locks
off once these callbacks will be locked by devlink too.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net/mlx5: Use devl_ API in mlx5_esw_offloads_devlink_port_register
The function mlx5_esw_offloads_devlink_port_register() calls
devlink_port_register() and devlink_rate_leaf_create(). Use devl_ API to
call devl_port_register() and devl_rate_leaf_create() accordingly and
add devlink instance lock in driver paths to this function.
Similarly, use devl_ API to call devl_port_unregister() and
devl_rate_leaf_destroy() in mlx5_esw_offloads_devlink_port_unregister()
and ensure locking devlink instance lock on the paths to this function
too.
This will be used by the downstream patch to invoke
mlx5_devlink_eswitch_mode_set() with devlink lock held.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Use devl_rate_nodes_destroy() instead of devlink_rate_nodes_destroy().
Add devlink instance lock in the driver paths to this function to have
it locked while calling devl_ API function.
This will be used by the downstream patch to invoke
mlx5_devlink_eswitch_mode_set() with devlink lock held.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
net/mlx5: Remove devl_unlock from mlx5_eswtich_mode_callback_enter
The function mlx5_eswtich_mode_callback_enter() was added as a temporary
workaround once devlink instance lock was added to devlink eswitch
callbacks. However, code review and testing show that all the callbacks
part to eswitch_mode_set don't take devlink instance lock in any flow
and so unlocking devlink instance lock while entering these functions is
not needed.
Remove devl_lock from mlx5_eswtich_mode_callback_enter() and devl_unlock
from mlx5_eswtich_mode_callback_exit(). Also remove the functions
mlx5_eswtich_mode_callback_enter()/exit() as they are not needed any
more. The callback eswitch_mode_set will be treated separately in the
following patches.
Signed-off-by: Moshe Shemesh <moshe@nvidia.com> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
When building with Clang we encounter the following warning:
| drivers/net/ethernet/amd/xgbe/xgbe-dcb.c:234:42: error: format specifies
| type 'unsigned char' but the argument has type '__u16' (aka 'unsigned
| short') [-Werror,-Wformat] pfc->pfc_cap, pfc->pfc_en, pfc->mbc,
| pfc->delay);
pfc->pfc_cap , pfc->pfc_cn, pfc->mbc are all of type `u8` while pfc->delay is
of type `u16`. The correct format specifiers `%hh[u|x]` were used for
the first three but not for pfc->delay, which is causing the warning
above.
Variadic functions (printf-like) undergo default argument promotion.
Documentation/core-api/printk-formats.rst specifically recommends using
the promoted-to-type's format flag. In this case `%d` (or `%x` to
maintain hex representation) should be used since both u8's and u16's
are fully representable by an int.
Moreover, C11 6.3.1.1 states:
(https://www.open-std.org/jtc1/sc22/wg14/www/docs/n1548.pdf) `If an int
can represent all values of the original type ..., the value is
converted to an int; otherwise, it is converted to an unsigned int.
These are called the integer promotions.`
Jakub Kicinski [Sat, 9 Jul 2022 02:52:53 +0000 (19:52 -0700)]
tls: rx: add counter for NoPad violations
As discussed with Maxim add a counter for true NoPad violations.
This should help deployments catch unexpected padded records vs
just control records which always need re-encryption.
bcm63xx: fix Tx cleanup when NAPI poll budget is zero
NAPI poll() function may be passed a budget value of zero, i.e. during
netpoll, which isn't NAPI context.
Therefore, napi_consume_skb() must be given budget value instead of
!force to truly discern netpoll-like scenarios.
octeontx2-af: Wrapper functions for MAC addr add/del/update/reset
These functions are wrappers for mac add/addr/del/update in
exact match table. These will be invoked from mbox handler routines
if exact matct table is supported and enabled.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
octeontx2: Modify mbox request and response structures
Exact match table modification requires wider fields as it has
more number of slots to fill in. Modifying an entry in exact match
table may cause hash collision and may be required to delete entry
from 4-way 2K table and add to fully associative 32 entry CAM table.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NPC exact match table installs drop on hit rules in
NPC mcam for each channel. This rule has broadcast and multicast
bits cleared. Exact match bit cleared and channel bits
set. If exact match table hit bit is 0, corresponding NPC mcam
drop rule will be hit for the packet and will be dropped.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
CN10KB silicon supports Exact match feature. This feature can be disabled
through devlink configuration. Devlink command fails if DMAC filter rules
are already present. Once disabled, legacy RPM based DMAC filters will be
configured.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
CN10KB silicon supports exact match table. Scanning KEX
profile should check for exact match feature is enabled
and then set profile masks properly.
These kex profile masks are required to configure NPC
MCAM drop rules. If there is a miss in exact match table,
these drop rules will drop those packets.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
CN10KB silicon has support for exact match table. This table
can be used to match maimum 64 bit value of KPU parsed output.
Hit/non hit in exact match table can be used as a KEX key to
NPC mcam.
This patch makes use of Exact match table to increase number of
DMAC filters supported. NPC mcam is no more need for each of these
DMAC entries as will be populated in Exact match table.
This patch implements following
1. Initialization of exact match table only for CN10KB.
2. Add/del/update interface function for exact match table.
Signed-off-by: Ratheesh Kannoth <rkannoth@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
CN10KB variant of CN10K series of silicons supports
a new feature where in a large protocol field
(eg 128bit IPv6 DIP) can be condensed into a small
hashed 32bit data. This saves a lot of space in MCAM key
and allows user to add more protocol fields into the filter.
A max of two such protocol data can be hashed.
This patch adds support for hashing IPv6 SIP and/or DIP.
If we set XFRM security policy by calling setsockopt with option
IPV6_XFRM_POLICY, the policy will be stored in 'sock_policy' in 'sock'
struct. However tcp_v6_send_response doesn't look up dst_entry with the
actual socket but looks up with tcp control socket. This may cause a
problem that a RST packet is sent without ESP encryption & peer's TCP
socket can't receive it.
This patch will make the function look up dest_entry with actual socket,
if the socket has XFRM policy(sock_policy), so that the TCP response
packet via this function can be encrypted, & aligned on the encrypted
TCP socket.
Tested: We encountered this problem when a TCP socket which is encrypted
in ESP transport mode encryption, receives challenge ACK at SYN_SENT
state. After receiving challenge ACK, TCP needs to send RST to
establish the socket at next SYN try. But the RST was not encrypted &
peer TCP socket still remains on ESTABLISHED state.
So we verified this with test step as below.
[Test step]
1. Making a TCP state mismatch between client(IDLE) & server(ESTABLISHED).
2. Client tries a new connection on the same TCP ports(src & dst).
3. Server will return challenge ACK instead of SYN,ACK.
4. Client will send RST to server to clear the SOCKET.
5. Client will retransmit SYN to server on the same TCP ports.
[Expected result]
The TCP connection should be established.
Cc: Maciej Żenczykowski <maze@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Sehee Lee <seheele@google.com> Signed-off-by: Sewook Seo <sewookseo@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
We've added 94 non-merge commits during the last 19 day(s) which contain
a total of 125 files changed, 5141 insertions(+), 6701 deletions(-).
The main changes are:
1) Add new way for performing BTF type queries to BPF, from Daniel Müller.
2) Add inlining of calls to bpf_loop() helper when its function callback is
statically known, from Eduard Zingerman.
3) Implement BPF TCP CC framework usability improvements, from Jörn-Thorben Hinz.
4) Add LSM flavor for attaching per-cgroup BPF programs to existing LSM
hooks, from Stanislav Fomichev.
5) Remove all deprecated libbpf APIs in prep for 1.0 release, from Andrii Nakryiko.
6) Add benchmarks around local_storage to BPF selftests, from Dave Marchevsky.
7) AF_XDP sample removal (given move to libxdp) and various improvements around AF_XDP
selftests, from Magnus Karlsson & Maciej Fijalkowski.
8) Add bpftool improvements for memcg probing and bash completion, from Quentin Monnet.
9) Add arm64 JIT support for BPF-2-BPF coupled with tail calls, from Jakub Sitnicki.
10) Sockmap optimizations around throughput of UDP transmissions which have been
improved by 61%, from Cong Wang.
11) Rework perf's BPF prologue code to remove deprecated functions, from Jiri Olsa.
12) Fix sockmap teardown path to avoid sleepable sk_psock_stop, from John Fastabend.
13) Fix libbpf's cleanup around legacy kprobe/uprobe on error case, from Chuang Wang.
14) Fix libbpf's bpf_helpers.h to work with gcc for the case of its sec/pragma
macro, from James Hilliard.
15) Fix libbpf's pt_regs macros for riscv to use a0 for RC register, from Yixun Lan.
16) Fix bpftool to show the name of type BPF_OBJ_LINK, from Yafang Shao.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (94 commits)
selftests/bpf: Fix xdp_synproxy build failure if CONFIG_NF_CONNTRACK=m/n
bpf: Correctly propagate errors up from bpf_core_composites_match
libbpf: Disable SEC pragma macro on GCC
bpf: Check attach_func_proto more carefully in check_return_code
selftests/bpf: Add test involving restrict type qualifier
bpftool: Add support for KIND_RESTRICT to gen min_core_btf command
MAINTAINERS: Add entry for AF_XDP selftests files
selftests, xsk: Rename AF_XDP testing app
bpf, docs: Remove deprecated xsk libbpf APIs description
selftests/bpf: Add benchmark for local_storage RCU Tasks Trace usage
libbpf, riscv: Use a0 for RC register
libbpf: Remove unnecessary usdt_rel_ip assignments
selftests/bpf: Fix few more compiler warnings
selftests/bpf: Fix bogus uninitialized variable warning
bpftool: Remove zlib feature test from Makefile
libbpf: Cleanup the legacy uprobe_event on failed add/attach_event()
libbpf: Fix wrong variable used in perf_event_uprobe_open_legacy()
libbpf: Cleanup the legacy kprobe_event on failed add/attach_event()
selftests/bpf: Add type match test against kernel's task_struct
selftests/bpf: Add nested type to type based tests
...
====================
If there is a MAC address specified in the device tree, then
use it. This is already perfectly legal to specify in accordance
with the generic ethernet-controller.yaml schema.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Fri, 8 Jul 2022 16:28:58 +0000 (16:28 +0000)]
af_unix: fix unix_sysctl_register() error path
We want to kfree(table) if @table has been kmalloced,
ie for non initial network namespace.
Fixes: cfdf57906d9a ("af_unix: Do not call kmemdup() for init_net's sysctl table.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
mptcp: Self test improvements and a header tweak
Patch 1 moves a definition to a header so it can be used in a struct
declaration.
Patch 2 adjusts a time threshold for a selftest that runs much slower on
debug kernels (and even more on slow CI infrastructure), to reduce
spurious failures.
Patches 3 & 4 improve userspace PM test coverage.
Patches 5 & 6 clean up output from a test script and selftest helper
tool.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The usage header of pm_nl_ctl command doesn't match with the context. So
this patch adds the missing userspace PM keywords 'ann', 'rem', 'csf',
'dsf', 'events' and 'listen' in it.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests: mptcp: avoid Terminated messages in userspace_pm
There're some 'Terminated' messages in the output of userspace pm tests
script after killing './pm_nl_ctl events' processes:
Created network namespaces ns1, ns2 [OK]
./userspace_pm.sh: line 166: 13735 Terminated ip netns exec "$ns2" ./pm_nl_ctl events >> "$client_evts" 2>&1
./userspace_pm.sh: line 172: 13737 Terminated ip netns exec "$ns1" ./pm_nl_ctl events >> "$server_evts" 2>&1
Established IPv4 MPTCP Connection ns2 => ns1 [OK]
./userspace_pm.sh: line 166: 13753 Terminated ip netns exec "$ns2" ./pm_nl_ctl events >> "$client_evts" 2>&1
./userspace_pm.sh: line 172: 13755 Terminated ip netns exec "$ns1" ./pm_nl_ctl events >> "$server_evts" 2>&1
Established IPv6 MPTCP Connection ns2 => ns1 [OK]
ADD_ADDR 10.0.2.2 (ns2) => ns1, invalid token [OK]
This patch adds a helper kill_wait(), in it using 'wait $pid 2>/dev/null'
commands after 'kill $pid' to avoid printing out these Terminated messages.
Use this helper instead of using 'kill $pid'.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds userspace pm subflow tests support for mptcp_join.sh
script. Add userspace pm create subflow and destroy test cases in
userspace_tests().
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds userspace pm tests support for mptcp_join.sh script. Add
userspace pm add_addr and rm_addr test cases in userspace_tests().
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 8 Jul 2022 17:14:09 +0000 (10:14 -0700)]
selftests: mptcp: tweak simult_flows for debug kernels
The mentioned test measures the transfer run-time to verify
that the user-space program is able to use the full aggregate B/W.
Even on (virtual) link-speed-bound tests, debug kernel can slow
down the transfer enough to cause sporadic test failures.
Instead of unconditionally raising the maximum allowed run-time,
tweak when the running kernel is a debug one, and use some simple/
rough heuristic to guess such scenarios.
Note: this intentionally avoids looking for /boot/config-<version> as
the latter file is not always available in our reference CI
environments.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Move macro MPTCPOPT_HMAC_LEN definition from net/mptcp/protocol.h to
include/net/mptcp.h.
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When we are operating in SGMII inband mode, it implies that there is a
PHY connected, and the ethtool advertisement for autoneg applies to
the PHY, not the SGMII link. When in 1000base-X mode, then this applies
to the 802.3z link and needs to be applied to the PCS.
Antoine Tenart [Thu, 7 Jul 2022 08:02:45 +0000 (10:02 +0200)]
Documentation: add a description for net.core.high_order_alloc_disable
A description is missing for the net.core.high_order_alloc_disable
option in admin-guide/sysctl/net.rst ; add it. The above sysctl option
was introduced by commit 8973faec3a91 ("net: add high_order_alloc_disable
sysctl/static key").
Thanks to Eric for running again the benchmark cited in the above
commit, showing this knob is now mostly of historical importance.
When building with Clang we encounter this warning:
| net/rxrpc/rxkad.c:434:33: error: format specifies type 'unsigned short'
| but the argument has type 'u32' (aka 'unsigned int') [-Werror,-Wformat]
| _leave(" = %d [set %hx]", ret, y);
y is a u32 but the format specifier is `%hx`. Going from unsigned int to
short int results in a loss of data. This is surely not intended
behavior. If it is intended, the warning should be suppressed through
other means.
This patch should get us closer to the goal of enabling the -Wformat
flag for Clang builds.
====================
tls: pad strparser, internal header, decrypt_ctx etc.
A grab bag of non-functional refactoring to make the series
which will let us decrypt into a fresh skb smaller.
Patches in this series are not strictly required to get the
decryption into a fresh skb going, they are more in the "things
which had been annoying me for a while" category.
====================
Jakub Kicinski [Fri, 8 Jul 2022 01:03:13 +0000 (18:03 -0700)]
tls: create an internal header
include/net/tls.h is getting a little long, and is probably hard
for driver authors to navigate. Split out the internals into a
header which will live under net/tls/. While at it move some
static inlines with a single user into the source files, add
a few tls_ prefixes and fix spelling of 'proccess'.
Jakub Kicinski [Fri, 8 Jul 2022 01:03:11 +0000 (18:03 -0700)]
tls: rx: wrap decrypt params in a struct
The max size of iv + aad + tail is 22B. That's smaller
than a single sg entry (32B). Don't bother with the
memory packing, just create a struct which holds the
max size of those members.
Jakub Kicinski [Fri, 8 Jul 2022 01:03:10 +0000 (18:03 -0700)]
tls: rx: always allocate max possible aad size for decrypt
AAD size is either 5 or 13. Really no point complicating
the code for the 8B of difference. This will also let us
turn the chunked up buffer into a sane struct.
Jakub Kicinski [Fri, 8 Jul 2022 01:03:09 +0000 (18:03 -0700)]
strparser: pad sk_skb_cb to avoid straddling cachelines
sk_skb_cb lives within skb->cb[]. skb->cb[] straddles
2 cache lines, each containing 24B of data.
The first cache line does not contain much interesting
information for users of strparser, so pad things a little.
Previously strp_msg->full_len would live in the first cache
line and strp_msg->offset in the second.
We need to reorder the 8 byte temp_reg with struct tls_msg
to prevent a 4B hole which would push the struct over 48B.
selftests/bpf: Fix xdp_synproxy build failure if CONFIG_NF_CONNTRACK=m/n
When CONFIG_NF_CONNTRACK=m, struct bpf_ct_opts and enum member
BPF_F_CURRENT_NETNS are not exposed. This commit allows building the
xdp_synproxy selftest in such cases. Note that nf_conntrack must be
loaded before running the test if it's compiled as a module.
This commit also allows this selftest to be successfully compiled when
CONFIG_NF_CONNTRACK is disabled.
One unused local variable of type struct bpf_ct_opts is also removed.
Daniel Müller [Thu, 7 Jul 2022 21:19:31 +0000 (21:19 +0000)]
bpf: Correctly propagate errors up from bpf_core_composites_match
This change addresses a comment made earlier [0] about a missing return
of an error when __bpf_core_types_match is invoked from
bpf_core_composites_match, which could have let to us erroneously
ignoring errors.
Regarding the typedef name check pointed out in the same context, it is
not actually an issue, because callers of the function perform a name
check for the root type anyway. To make that more obvious, let's add
comments to the function (similar to what we have for
bpf_core_types_are_compat, which is called in pretty much the same
context).
James Hilliard [Wed, 6 Jul 2022 11:18:38 +0000 (05:18 -0600)]
libbpf: Disable SEC pragma macro on GCC
It seems the gcc preprocessor breaks with pragmas when surrounding
__attribute__.
Disable these pragmas on GCC due to upstream bugs see:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55578
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90400
Fixes errors like:
error: expected identifier or '(' before '#pragma'
106 | SEC("cgroup/bind6")
| ^~~
error: expected '=', ',', ';', 'asm' or '__attribute__' before '#pragma'
114 | char _license[] SEC("license") = "GPL";
| ^~~
Because we don't enforce expected_attach_type for XDP programs,
we end up in hitting 'if (prog->expected_attach_type == BPF_LSM_CGROUP'
part in check_return_code and follow up with testing
`prog->aux->attach_func_proto->type`, but `prog->aux->attach_func_proto`
is NULL.
Add explicit prog_type check for the "Note, BPF_LSM_CGROUP that
attach ..." condition. Also, don't skip return code check for
LSM/STRUCT_OPS.
The above actually brings an issue with existing selftest which
tries to return EPERM from void inet_csk_clone. Fix the
test (and move called_socket_clone to make sure it's not
incremented in case of an error) and add a new one to explicitly
verify this condition.
Fixes: 6719214cf03c ("bpf: per-cgroup lsm flavor") Reported-by: syzbot+5cc0730bd4b4d2c5f152@syzkaller.appspotmail.com Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220708175000.2603078-1-sdf@google.com
net: ag71xx: switch to napi_build_skb() to reuse skbuff_heads
napi_build_skb() reuses NAPI skbuff_head cache in order to save some
cycles on freeing/allocating skbuff_heads on every new Rx or completed
Tx.
Use napi_consume_skb() to feed the cache with skbuff_heads of completed
Tx, so it's never empty. The budget parameter is added to indicate NAPI
context, as a value of zero can be passed in the case of netpoll.
Signed-off-by: Sieng-Piaw Liew <liew.s.piaw@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Current implementation is such that driver first resets the
existing PFC config before applying new pfc configuration.
This creates a problem like once PF or VFs requests PFC config
previous pfc config by other PFVfs is getting reset.
This patch fixes the problem by removing unnecessary resetting
of PFC config. Also configure Pause quanta value to smaller as
current value is too high.
Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Daniel Müller [Wed, 6 Jul 2022 21:28:55 +0000 (21:28 +0000)]
selftests/bpf: Add test involving restrict type qualifier
This change adds a type based test involving the restrict type qualifier
to the BPF selftests. On the btfgen path, this will verify that bpftool
correctly handles the corresponding RESTRICT BTF kind.
Daniel Müller [Wed, 6 Jul 2022 21:28:54 +0000 (21:28 +0000)]
bpftool: Add support for KIND_RESTRICT to gen min_core_btf command
This change adjusts bpftool's type marking logic, as used in conjunction
with TYPE_EXISTS relocations, to correctly recognize and handle the
RESTRICT BTF kind.
Lukas reported that after commit f20ccdf36230 ("libbpf: move xsk.{c,h}
into selftests/bpf") MAINTAINERS file needed an update.
In the meantime, Magnus removed AF_XDP samples in commit 4aa67e7d13dc
("bpf, samples: Remove AF_XDP samples"), but selftests part still misses
its entry in MAINTAINERS.
Now that xdpxceiver became xskxceiver, tools/testing/selftests/bpf/*xsk*
will match all of the files related to AF_XDP testing (test_xsk.sh,
xskxceiver, xsk_prereqs.sh, xsk.{c,h}).
Recently, xsk part of libbpf was moved to selftests/bpf directory and
lives on its own because there is an AF_XDP testing application that
needs it called xdpxceiver. That name makes it a bit hard to indicate
who maintains it as there are other XDP samples in there, whereas this
one is strictly about AF_XDP.
Do s/xdpxceiver/xskxceiver so that it will be easier to figure out who
maintains it. A follow-up patch will correct MAINTAINERS file.
When building with Clang we encounter the following warnings:
| net/l2tp/l2tp_debugfs.c:187:40: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] seq_printf(m, " nr %hu, ns %hu\n", session->nr,
| session->ns);
-
| net/l2tp/l2tp_debugfs.c:196:32: error: format specifies type 'unsigned
| short' but the argument has type 'int' [-Werror,-Wformat]
| session->l2specific_type, l2tp_get_l2specific_len(session));
-
| net/l2tp/l2tp_debugfs.c:219:6: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] session->nr, session->ns,
Both session->nr and ->nc are of type `u32`. The currently used format
specifier is `%hu` which describes a `u16`. My proposed fix is to listen
to Clang and use the correct format specifier `%u`.
For the warning at line 196, l2tp_get_l2specific_len() returns an int
and should therefore be using the `%d` format specifier.
Link: https://github.com/ClangBuiltLinux/linux/issues/378 Signed-off-by: Justin Stitt <justinstitt@google.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 8 Jul 2022 03:01:53 +0000 (20:01 -0700)]
Merge branch 'polarfire-soc-macb-reset-support'
Conor Dooley says:
====================
PolarFire SoC macb reset support
The Cadence MACBs on PolarFire SoC (MPFS) have reset capability and are
compatible with the zynqmp's init function. I have removed the zynqmp
specific comments from that function & renamed it to reflect what it
does, since it is no longer zynqmp only.
MPFS's MACB had previously used the generic binding, so I also added
the required specific binding.
For v2, I noticed some low hanging cleanup fruit so there are extra
patches added for that:
moving the init function out of the config structs, aligning the
alignment of the zynqmp & default config structs with the other dozen
or so structs & simplifing the error paths to use dev_err_probe().
Feel free to apply as many or as few of those as you like.
====================
net: macb: sort init_reset_optional() with other init()s
init_reset_optional() is somewhat oddly placed amidst the macb_config
struct definitions. Move it to a more reasonable location alongside
the fu540 init functions.
To date, the Microchip PolarFire SoC (MPFS) has been using the
cdns,macb compatible, however the generic device does not have reset
support. Add a new compatible & .data for MPFS to hook into the reset
functionality added for zynqmp support (and make the zynqmp init
function generic in the process).
Until now the PolarFire SoC (MPFS) has been using the generic
"cdns,macb" compatible but has optional reset support. Add a specific
compatible which falls back to the currently used generic binding.
Acked-by: Rob Herring <robh@kernel.org> Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When building with clang we encounter this warning:
| net/l2tp/l2tp_ppp.c:1557:6: error: format specifies type 'unsigned
| short' but the argument has type 'u32' (aka 'unsigned int')
| [-Werror,-Wformat] session->nr, session->ns,
Both session->nr and session->ns are of type u32. The format specifier
previously used is `%hu` which would truncate our unsigned integer from
32 to 16 bits. This doesn't seem like intended behavior, if it is then
perhaps we need to consider suppressing the warning with pragma clauses.
This patch should get us closer to the goal of enabling the -Wformat
flag for Clang builds.
Jie Wang [Tue, 5 Jul 2022 11:35:15 +0000 (19:35 +0800)]
net: page_pool: optimize page pool page allocation in NUMA scenario
Currently NIC packet receiving performance based on page pool deteriorates
occasionally. To analysis the causes of this problem page allocation stats
are collected. Here are the stats when NIC rx performance deteriorates:
The rx_pp_alloc_waive count indicates that a large number of pages' numa
node are inconsistent with the NIC device numa node. Therefore these pages
can't be reused by the page pool. As a result, many new pages would be
allocated by __page_pool_alloc_pages_slow which is time consuming. This
causes the NIC rx performance fluctuations.
The main reason of huge numa mismatch pages in page pool is that page pool
uses alloc_pages_bulk_array to allocate original pages. This function is
not suitable for page allocation in NUMA scenario. So this patch uses
alloc_pages_bulk_array_node which has a NUMA id input parameter to ensure
the NUMA consistent between NIC device and allocated pages.
Repeated NIC rx performance tests are performed 40 times. NIC rx bandwidth
is higher and more stable compared to the datas above. Here are three test
stats, the rx_pp_alloc_waive count is zero and rx_pp_alloc_slow which
indicates pages allocated from slow patch is relatively low.
- eth: ibmvnic: properly dispose of all skbs during a failover
Previous releases - always broken:
- bpf:
- fix insufficient bounds propagation from
adjust_scalar_min_max_vals
- clear page contiguity bit when unmapping pool
- netfilter: nft_set_pipapo: release elements in clone from
abort path
- mptcp: netlink: issue MP_PRIO signals from userspace PMs
- can:
- rcar_canfd: fix data transmission failed on R-Car V3U
- gs_usb: gs_usb_open/close(): fix memory leak
Misc:
- add Wenjia as SMC maintainer"
* tag 'net-5.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (57 commits)
wireguard: Kconfig: select CRYPTO_CHACHA_S390
crypto: s390 - do not depend on CRYPTO_HW for SIMD implementations
wireguard: selftests: use microvm on x86
wireguard: selftests: always call kernel makefile
wireguard: selftests: use virt machine on m68k
wireguard: selftests: set fake real time in init
r8169: fix accessing unset transport header
net: rose: fix UAF bug caused by rose_t0timer_expiry
usbnet: fix memory leak in error case
Revert "tls: rx: move counting TlsDecryptErrors for sync"
mptcp: update MIB_RMSUBFLOW in cmd_sf_destroy
mptcp: fix local endpoint accounting
selftests: mptcp: userspace PM support for MP_PRIO signals
mptcp: netlink: issue MP_PRIO signals from userspace PMs
mptcp: Acquire the subflow socket lock before modifying MP_PRIO flags
mptcp: Avoid acquiring PM lock for subflow priority changes
mptcp: fix locking in mptcp_nl_cmd_sf_destroy()
net/mlx5e: Fix matchall police parameters validation
net/sched: act_police: allow 'continue' action offload
net: lan966x: hardcode the number of external ports
...
Merge tag 'pinctrl-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
- Tag Intel pin control as supported in MAINTAINERS
- Fix a NULL pointer exception in the Aspeed driver
- Correct some NAND functions in the Sunxi A83T driver
- Use the right offset for some Sunxi pins
- Fix a zero base offset in the Freescale (NXP) i.MX93
- Fix the IRQ support in the STM32 driver
* tag 'pinctrl-v5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: stm32: fix optional IRQ support to gpios
pinctrl: imx: Add the zero base flag for imx93
pinctrl: sunxi: sunxi_pconf_set: use correct offset
pinctrl: sunxi: a83t: Fix NAND function name for some pins
pinctrl: aspeed: Fix potential NULL dereference in aspeed_pinmux_set_mux()
MAINTAINERS: Update Intel pin control to Supported
These are indeed "should not happen" situations, but it turns out recent
changes made the 'task_is_stopped_or_trace()' case trigger (fix for that
exists, is pending more testing), and the BUG_ON() makes it
unnecessarily hard to actually debug for no good reason.
It's been that way for a long time, but let's make it clear: BUG_ON() is
not good for debugging, and should never be used in situations where you
could just say "this shouldn't happen, but we can continue".
Use WARN_ON_ONCE() instead to make sure it gets logged, and then just
continue running. Instead of making the system basically unusuable
because you crashed the machine while potentially holding some very core
locks (eg this function is commonly called while holding 'tasklist_lock'
for writing).
Dave Marchevsky [Tue, 5 Jul 2022 19:00:18 +0000 (12:00 -0700)]
selftests/bpf: Add benchmark for local_storage RCU Tasks Trace usage
This benchmark measures grace period latency and kthread cpu usage of
RCU Tasks Trace when many processes are creating/deleting BPF
local_storage. Intent here is to quantify improvement on these metrics
after Paul's recent RCU Tasks patches [0].
Specifically, fork 15k tasks which call a bpf prog that creates/destroys
task local_storage and sleep in a loop, resulting in many
call_rcu_tasks_trace calls.
To determine grace period latency, trace time elapsed between
rcu_tasks_trace_pregp_step and rcu_tasks_trace_postgp; for cpu usage
look at rcu_task_trace_kthread's stime in /proc/PID/stat.
On my virtualized test environment (Skylake, 8 cpus) benchmark results
demonstrate significant improvement:
BEFORE Paul's patches:
SUMMARY tasks_trace grace period latency avg 22298.551 us stddev 1302.165 us
SUMMARY ticks per tasks_trace grace period avg 2.291 stddev 0.324
AFTER Paul's patches:
SUMMARY tasks_trace grace period latency avg 16969.197 us stddev 2525.053 us
SUMMARY ticks per tasks_trace grace period avg 1.146 stddev 0.178
Note that since these patches are not in bpf-next benchmarking was done
by cherry-picking this patch onto rcu tree.
Yixun Lan [Wed, 6 Jul 2022 14:02:04 +0000 (22:02 +0800)]
libbpf, riscv: Use a0 for RC register
According to the RISC-V calling convention register usage here [0], a0
is used as return value register, so rename it to make it consistent
with the spec.
Commit 4dcc54ae1a06 ("Merge branch 'af_unix-per-netns-socket-hash'") and
commit 83c5e0e6debf ("af_unix: Put pathname sockets in the global hash
table.") changed a hash table layout.
Now, while looking up sockets, we traverse the global table for the
pathname sockets and the first half of each per-netns hash table for
abstract sockets, where pathname sockets are also linked. Thus, the
more pathname sockets we have, the longer we take to look up abstract
sockets. This characteristic has been there before the layout change,
but we can improve it now.
This patch changes the per-netns hash table's layout so that sockets not
requiring lookup reside in the first half and do not impact the lookup of
abstract sockets.
We have run a test that bind()s 100,000 abstract/pathname sockets for
each, bind()s an abstract socket 100,000 times and measures the time
on __unix_find_socket_byname(). The result shows that the patch makes
each lookup faster.
Jakub Kicinski [Thu, 7 Jul 2022 03:04:09 +0000 (20:04 -0700)]
Merge branch 'wireguard-patches-for-5-19-rc6'
Jason A. Donenfeld says:
====================
wireguard patches for 5.19-rc6
1) A few small fixups to the selftests, per usual. Of particular note is
a fix for a test flake that occurred on especially fast systems that
boot in less than a second.
2) An addition during this cycle of some s390 crypto interacted with the
way wireguard selects dependencies, resulting in linker errors
reported by the kernel test robot. So Vladis sent in a patch for
that, which also required a small preparatory fix moving some Kconfig
symbols around.
====================
Select the new implementation of CHACHA20 for S390 when available.
It is faster than the generic software implementation, but also prevents
some linker errors in certain situations.
crypto: s390 - do not depend on CRYPTO_HW for SIMD implementations
Various accelerated software implementation Kconfig values for S390 were
mistakenly placed into drivers/crypto/Kconfig, even though they're
mainly just SIMD code and live in arch/s390/crypto/ like usual. This
gives them the very unusual dependency on CRYPTO_HW, which leads to
problems elsewhere.
This patch fixes the issue by moving the Kconfig values for non-hardware
drivers into the usual place in crypto/Kconfig.
Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
These selftests are used for much more extensive changes than just the
wireguard source files. So always call the kernel's build file, which
will do something or nothing after checking the whole tree, per usual.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>