Saeed Mahameed [Tue, 22 Mar 2022 17:22:24 +0000 (10:22 -0700)]
net/mlx5e: Fix build warning, detected write beyond size of field
When merged with Linus tree, the cited patch below will cause the
following build warning:
In function 'fortify_memset_chk',
inlined from 'mlx5e_xmit_xdp_frame' at drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:438:3:
include/linux/fortify-string.h:242:25: error: call to '__write_overflow_field' declared with attribute warning: detected write beyond size of field (1st parameter); maybe use struct_group()? [-Werror=attribute-warning]
242 | __write_overflow_field(p_size_field, size);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Fix that by grouping the fields to memeset in struct_group() to avoid
the false alarm.
Fixes: 552fab30a079 ("net/mlx5e: Don't prefill WQEs in XDP SQ in the multi buffer mode") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Link: https://lore.kernel.org/r/20220322172224.31849-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Miri Korenblit [Tue, 22 Mar 2022 15:39:43 +0000 (17:39 +0200)]
iwlwifi: mvm: Don't fail if PPAG isn't supported
When we're copying the PPAG table into the cmd structure we're failing
if the table doesn't exist in ACPI or is invalid, or if the FW doesn't
support PPAG setting etc.
This is wrong because those are valid scenarios. Fix this by not
failing in those cases.
Fixes: 6b77c1ef573c ("iwlwifi: acpi: move ppag code from mvm to fw/acpi") Tested-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Acked-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/iwlwifi.20220322173828.fa47f369b717.I6a9c65149c2c3c11337f3a802dff22f514a3a436@changeid Signed-off-by: Jakub Kicinski <kuba@kernel.org>
When compiler emits endbr insn the function address could
be different than what bpf_get_func_ip() reports.
This is a short term workaround.
bpf_get_func_ip() will be fixed later.
Ido Schimmel [Mon, 21 Mar 2022 17:51:01 +0000 (19:51 +0200)]
selftests: forwarding: Disable learning before link up
Disable learning before bringing the bridge port up in order to avoid
the FDB being populated and the test failing.
Before:
# ./bridge_locked_port.sh
RTNETLINK answers: File exists
TEST: Locked port ipv4 [FAIL]
Ping worked after locking port, but before adding FDB entry
TEST: Locked port ipv6 [ OK ]
TEST: Locked port vlan [ OK ]
After:
# ./bridge_locked_port.sh
TEST: Locked port ipv4 [ OK ]
TEST: Locked port ipv6 [ OK ]
TEST: Locked port vlan [ OK ]
Fixes: 718b1fab90e7 ("selftests: forwarding: tests of locked port feature") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Bill Wendling [Mon, 21 Mar 2022 02:31:55 +0000 (19:31 -0700)]
bnx2x: truncate value to original sizing
The original behavior was to print out unsigned short or unsigned char
values. The change in commit 509393644864 ("bnx2x: use correct format
characters") prints out the whole value if not truncated. So truncate
the value to an unsigned {short|char} to retain the original behavior.
====================
net: mscc-miim: add integrated PHY reset support
The MDIO driver has support to release the integrated PHYs from reset.
This was implemented for the SparX-5 for now. Now add support for the
LAN966x, too.
====================
Michael Walle [Fri, 18 Mar 2022 20:13:24 +0000 (21:13 +0100)]
net: mdio: mscc-miim: add lan966x internal phy reset support
The LAN966x has two internal PHYs which are in reset by default. The
driver already supported the internal PHYs of the SparX-5. Now add
support for the LAN966x, too. Add a new compatible to distinguish them.
The LAN966x has additional control bits in this register, thus convert
the regmap_write() to regmap_update_bits() to leave the remaining bits
untouched. This doesn't change anything for the SparX-5 SoC, because
there, the register consists only of reset bits.
Signed-off-by: Michael Walle <michael@walle.cc> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The MDIO controller has support to release the internal PHYs from reset
by specifying a second memory resource. This is different between the
currently supported SparX-5 and the LAN966x. Add a new compatible to
distinguish between these two.
Signed-off-by: Michael Walle <michael@walle.cc> Acked-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: mv88e6xxx: Fill in STU support for all supported chips
Some chips using the split VTU/STU design will not accept VTU entries
who's SID points to an invalid STU entry. Therefore, mark all those
chips with either the mv88e6352_g1_stu_* or mv88e6390_g1_stu_* ops as
appropriate.
Notably, chips for the Opal Plus (6085/6097) era seem to use a
different implementation than those from Agate (6352) and onwards,
even though their external interface is the same. The former happily
accepts VTU entries referencing invalid STU entries, while the latter
does not.
This fixes an issue where the driver would fail to probe switch trees
that contained chips of the Agate/Topaz generation which did not
declare STU support, as loaded VTU entries would be read back as
invalid.
Fixes: 4c7969625cdc ("net: dsa: mv88e6xxx: Disentangle STU from VTU") Reported-by: Marek BehĂșn <kabel@kernel.org> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Tested-by: Marek BehĂșn <kabel@kernel.org> Link: https://lore.kernel.org/r/20220319110345.555270-1-tobias@waldekranz.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 18 Mar 2022 19:58:12 +0000 (21:58 +0200)]
net: dsa: felix: allow PHY_INTERFACE_MODE_INTERNAL on port 5
The Felix switch has 6 ports, 2 of which are internal.
Due to some misunderstanding, my initial suggestion for
vsc9959_port_modes[]:
https://patchwork.kernel.org/project/netdevbpf/patch/20220129220221.2823127-10-colin.foster@in-advantage.com/#24718277
got translated by Colin into a 5-port array, leading to an all-zero port
mode mask for port 5.
net: dsa: mv88e6xxx: Require ops be implemented to claim STU support
Simply having a physical STU table in the device doesn't do us any
good if there's no implementation of the relevant ops to access that
table. So ensure that chips that claim STU support can also talk to
the hardware.
This fixes an issue where chips that had a their ->info->max_sid
set (due to their family membership), but no implementation (due to
their chip-specific ops struct) would fail to probe.
Fixes: 4c7969625cdc ("net: dsa: mv88e6xxx: Disentangle STU from VTU") Reported-by: Marek BehĂșn <kabel@kernel.org> Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Tested-by: Marek BehĂșn <kabel@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ziyang Xuan [Sat, 19 Mar 2022 03:15:20 +0000 (11:15 +0800)]
net/tls: optimize judgement processes in tls_set_device_offload()
It is known that priority setting HW offload when set tls TX/RX offload
by setsockopt(). Check netdevice whether support NETIF_F_HW_TLS_TX or
not at the later stages in the whole tls_set_device_offload() process,
some memory allocations have been done before that. We must release those
memory and return error if we judge the netdevice not support
NETIF_F_HW_TLS_TX. It is redundant.
Move NETIF_F_HW_TLS_TX judgement forward, and move start_marker_record
and offload_ctx memory allocation back slightly. Thus, we can get
simpler exception handling process.
Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Yonghong Song [Sun, 20 Mar 2022 03:20:09 +0000 (20:20 -0700)]
bpftool: Fix a bug in subskeleton code generation
Compiled with clang by adding LLVM=1 both kernel and selftests/bpf
build, I hit the following compilation error:
In file included from /.../tools/testing/selftests/bpf/prog_tests/subskeleton.c:6:
./test_subskeleton_lib.subskel.h:168:6: error: variable 'err' is used uninitialized whenever
'if' condition is true [-Werror,-Wsometimes-uninitialized]
if (!s->progs)
^~~~~~~~~
./test_subskeleton_lib.subskel.h:181:11: note: uninitialized use occurs here
errno = -err;
^~~
./test_subskeleton_lib.subskel.h:168:2: note: remove the 'if' if its condition is always false
if (!s->progs)
^~~~~~~~~~~~~~
The compilation error is triggered by the following code
...
int err;
err:
test_subskeleton_lib__destroy(obj);
errno = -err;
...
in test_subskeleton_lib__open(). The 'err' is not initialized, yet it
is used in 'errno = -err' later.
The fix is to remove 'errno = -err' since errno has been set properly
in all incoming branches.
Song Liu [Mon, 21 Mar 2022 18:00:08 +0000 (11:00 -0700)]
bpf: Fix bpf_prog_pack for multi-node setup
module_alloc requires num_online_nodes * PMD_SIZE to allocate huge pages.
bpf_prog_pack uses pack of size num_online_nodes * PMD_SIZE.
OTOH, module_alloc returns addresses that are PMD_SIZE aligned (instead of
num_online_nodes * PMD_SIZE aligned). Therefore, PMD_MASK should be used
to calculate pack_ptr in bpf_prog_pack_free().
Fixes: 1ca2b2ac2c72 ("bpf: Select proper size for bpf_prog_pack") Reported-by: syzbot+c946805b5ce6ab87df0b@syzkaller.appspotmail.com Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220321180009.1944482-2-song@kernel.org
net: Revert the softirq will run annotation in ____napi_schedule().
The lockdep annotation lockdep_assert_softirq_will_run() expects that
either hard or soft interrupts are disabled because both guaranty that
the "raised" soft-interrupts will be processed once the context is left.
This triggers in flush_smp_call_function_from_idle() but it this case it
explicitly calls do_softirq() in case of pending softirqs.
Revert the "softirq will run" annotation in ____napi_schedule() and move
the check back to __netif_rx() as it was. Keep the IRQ-off assert in
____napi_schedule() because this is always required.
Fixes: dd30f625f3f88 ("net: Add lockdep asserts to ____napi_schedule().") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://lore.kernel.org/r/YjhD3ZKWysyw8rc6@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David S. Miller [Mon, 21 Mar 2022 14:11:38 +0000 (14:11 +0000)]
Merge branch 'devlink-locking'
Jakub Kicinski says:
====================
devlink: hold the instance lock in eswitch callbacks
Series number 2 in the effort to hold the devlink instance lock
in call driver callbacks. We have the following drivers using
this API:
- bnxt, nfp, netdevsim - their own locking is removed / simplified
by this series; all of them needed a lock to protect from changes
to the number of VFs while switching modes, now the VF config bus
callback takes the devlink instance lock via devl_lock();
- ice - appears not to allow changing modes while SR-IOV enabled,
so nothing to do there;
- liquidio - does not contain any locking;
- octeontx2/af - is very special but at least doesn't have locking
so doesn't get in the way either;
- mlx5 has a wealth of locks - I chickened out and dropped the lock
in the callbacks so that I can leave the driver be, for now.
The last one is obviously not ideal, but I would prefer to transition
the API already as it make take longer.
v2: use a wrapper in mlx5 and extend the comment
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 18 Mar 2022 19:23:44 +0000 (12:23 -0700)]
devlink: hold the instance lock during eswitch_mode callbacks
Make the devlink core hold the instance lock during eswitch_mode
callbacks. Cheat in case of mlx5 (see the cover letter).
Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 18 Mar 2022 19:23:42 +0000 (12:23 -0700)]
netdevsim: replace port_list_lock with devlink instance lock
Take advantage of the devlink instance lock for protecting
the port list. This will simplify locking even more once
all devlink callbacks hold the instance lock.
We need to add locking in nsim_dev_port_add_all() which used
to assume higher layer protection when accessing the list.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 18 Mar 2022 19:23:40 +0000 (12:23 -0700)]
bnxt: use the devlink instance lock to protect sriov
In prep for .eswitch_mode_set being called with the devlink instance
lock held use that lock explicitly instead of creating a local mutex
just for the sriov reconfig.
Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiri Olsa [Mon, 21 Mar 2022 07:01:13 +0000 (08:01 +0100)]
bpf: Fix kprobe_multi return probe backtrace
Andrii reported that backtraces from kprobe_multi program attached
as return probes are not complete and showing just initial entry [1].
It's caused by changing registers to have original function ip address
as instruction pointer even for return probe, which will screw backtrace
from return probe.
This change keeps registers intact and store original entry ip and
link address on the stack in bpf_kprobe_multi_run_ctx struct, where
bpf_get_func_ip and bpf_get_attach_cookie helpers for kprobe_multi
programs can find it.
Hangbin Liu [Mon, 21 Mar 2022 02:41:49 +0000 (10:41 +0800)]
selftests/bpf/test_lirc_mode2.sh: Exit with proper code
When test_lirc_mode2_user exec failed, the test report failed but still
exit with 0. Fix it by exiting with an error code.
Another issue is for the LIRCDEV checking. With bash -n, we need to quote
the variable, or it will always be true. So if test_lirc_mode2_user was
not run, just exit with skip code.
Casper Andersson [Mon, 21 Mar 2022 10:14:45 +0000 (11:14 +0100)]
net: sparx5: Add arbiter for managing PGID table
The PGID (Port Group ID) table holds port masks
for different purposes. The first 72 are reserved
for port destination masks, flood masks, and CPU
forwarding. The rest are shared between multicast,
link aggregation, and virtualization profiles. The
GLAG area is reserved to not be used by anything
else, since it is a subset of the MCAST area.
The arbiter keeps track of which entries are in
use. You can ask for a free ID or give back one
you are done using.
Signed-off-by: Casper Andersson <casper.casan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 21 Mar 2022 13:21:17 +0000 (13:21 +0000)]
Merge branch 'nfp3800'
Simon Horman says:
====================
nfp: support for NFP-3800
Yinjun Zhan says:
This is the second of a two part series to support the NFP-3800 device.
To utilize the new hardware features of the NFP-3800, driver adds support
of a new data path NFDK. This series mainly does some refactor work to the
data path related implementations. The data path specific implementations
are now separated into nfd3 and nfdk directories respectively, and the
common part is also moved into a new file.
* The series starts with a small refinement in Patch 1/10. Patches 2/10 and
3/10 are the main refactoring of data path implementation, which prepares
for the adding the NFDK data path.
* Before the introduction of NFDK, there's some more preparation work
for NFP-3800 features, such as multi-descriptor per-packet and write-back
mechanism of TX pointer, which is done in patches 4/10, 5/10, 6/10, 7/10.
* Patch 8/10 allows the driver to select data path according
to firmware version. Finally, patches 9/10 and 10/10 introduce the new
NFDK data path.
Changes between v1 and v2
* Correct kdoc for nfp_nfdk_tx()
* Correct build warnings on 32-bit
Thanks to everyone who contributed to this work.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yinjun Zhang [Mon, 21 Mar 2022 10:42:09 +0000 (11:42 +0100)]
nfp: nfdk: implement xdp tx path for NFDK
Due to the different definition of txbuf in NFDK comparing to NFD3,
there're no pre-allocated txbufs for xdp use in NFDK's implementation,
we just use the existed rxbuf and recycle it when xdp tx is completed.
For each packet to transmit in xdp path, we cannot use more than
`NFDK_TX_DESC_PER_SIMPLE_PKT` txbufs, one is to stash virtual address,
and another is for dma address, so currently the amount of transmitted
bytes is not accumulated. Also we borrow the last bit of virtual addr
to indicate a new transmitted packet due to address's alignment
attribution.
Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:08 +0000 (11:42 +0100)]
nfp: add support for NFDK data path
Add new data path. The TX is completely different, each packet
has multiple descriptor entries (between 2 and 32). TX ring is
divided into blocks 32 descriptor, and descritors of one packet
can't cross block bounds. The RX side is the same for now.
ABI version 5 or later is required. There is no support for
VLAN insertion on TX. XDP_TX action and AF_XDP zero-copy is not
implemented in NFDK path.
Changes to Jakub's work:
* Move statistics of hw_csum_tx after jumbo packet's segmentation.
* Set L3_CSUM flag to enable recaculating of L3 header checksum
in ipv4 case.
* Mark the case of TSO a packet with metadata prepended as
unsupported.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Xingfeng Hu <xingfeng.hu@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Dianchao Wang <dianchao.wang@corigine.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:07 +0000 (11:42 +0100)]
nfp: choose data path based on version
Prepare for choosing data path based on the firmware version field.
Exploit one bit from the reserved byte in the firmware version field
as the data path type. We need the firmware version right after
vNIC is allocated, so it has to be read inside nfp_net_alloc(),
callers don't have to set it afterwards.
Following patches will bring the implementation of the second data
path.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:06 +0000 (11:42 +0100)]
nfp: add per-data path feature mask
Make sure that features supported only by some of the data paths
are not enabled for all. Add a mask of supported features into
the data path op structure.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:05 +0000 (11:42 +0100)]
nfp: use TX ring pointer write back
Newer versions of the PCIe microcode support writing back the
position of the TX pointer back into host memory. This speeds
up TX completions, because we avoid a read from device memory
(replacing PCIe read with DMA coherent read).
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:04 +0000 (11:42 +0100)]
nfp: move tx_ring->qcidx into cold data
QCidx is not used on fast path, move it to the lower cacheline.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:03 +0000 (11:42 +0100)]
nfp: prepare for multi-part descriptors
New datapaths may use multiple descriptor units to describe
a single packet. Prepare for that by adding a descriptors
per simple frame constant into ring size calculations.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:02 +0000 (11:42 +0100)]
nfp: use callbacks for slow path ring related functions
To reduce the coupling of slow path ring implementations and their
callers, use callbacks instead.
Changes to Jakub's work:
* Also use callbacks for xmit functions
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:01 +0000 (11:42 +0100)]
nfp: move the fast path code to separate files
In preparation for support for a new datapath format move all
ring and fast path logic into separate files. It is basically
a verbatim move with some wrapping functions, no new structures
and functions added.
The current data path is called NFD3 from the initial version
of the driver ABI it used. The non-fast path, but ring related
functions are moved to nfp_net_dp.c file.
Changes to Jakub's work:
* Rebase on xsk related code.
* Split the patch, move the callback changes to next commit.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Mon, 21 Mar 2022 10:42:00 +0000 (11:42 +0100)]
nfp: calculate ring masks without conditionals
Ring enable masks are 64bit long. Replace mask calculation from:
block_cnt == 64 ? 0xffffffffffffffffULL : (1 << block_cnt) - 1
with:
(U64_MAX >> (64 - block_cnt))
to simplify the code.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Fei Qin <fei.qin@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next.
This patchset contains updates for the nf_tables register tracking
infrastructure, disable bogus warning when attaching ct helpers,
one namespace pollution fix and few cleanups for the flowtable.
1) Revisit conntrack gc routine to reduce chances of overruning
the netlink buffer from the event path. From Florian Westphal.
2) Disable warning on explicit ct helper assignment, from Phil Sutter.
3) Read-only expressions do not update registers, mark them as
NFT_REDUCE_READONLY. Add helper functions to update the register
tracking information. This patch re-enables the register tracking
infrastructure.
4) Cancel register tracking in case an expression fully/partially
clobbers existing data.
5) Add register tracking support for remaining expressions: ct,
lookup, meta, numgen, osf, hash, immediate, socket, xfrm, tunnel,
fib, exthdr.
6) Rename init and exit functions for the conntrack h323 helper,
from Randy Dunlap.
7) Remove redundant field in struct flow_offload_work.
8) Update nf_flow_table_iterate() to pass flowtable to callback.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Casper Andersson [Fri, 18 Mar 2022 12:53:31 +0000 (13:53 +0100)]
net: sparx5: Use vid 1 when bridge default vid 0 to avoid collision
Standalone ports use vid 0. Let the bridge use vid 1 when
"vlan_default_pvid 0" is set to avoid collisions. Since no
VLAN is created when default pvid is 0 this is set
at "PORT_ATTR_SET" and handled in the Switchdev fdb handler.
Signed-off-by: Casper Andersson <casper.casan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wang Yufen [Fri, 18 Mar 2022 06:35:08 +0000 (14:35 +0800)]
netlabel: fix out-of-bounds memory accesses
In calipso_map_cat_ntoh(), in the for loop, if the return value of
netlbl_bitmap_walk() is equal to (net_clen_bits - 1), when
netlbl_bitmap_walk() is called next time, out-of-bounds memory accesses
of bitmap[byte_offset] occurs.
The bug was found during fuzzing. The following is the fuzzing report
BUG: KASAN: slab-out-of-bounds in netlbl_bitmap_walk+0x3c/0xd0
Read of size 1 at addr ffffff8107bf6f70 by task err_OH/252
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wang Yufen <wangyufen@huawei.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
bpf: Check for NULL return from bpf_get_btf_vmlinux
When CONFIG_DEBUG_INFO_BTF is disabled, bpf_get_btf_vmlinux can return a
NULL pointer. Check for it in btf_get_module_btf to prevent a NULL pointer
dereference.
While kernel test robot only complained about this specific case, let's
also check for NULL in other call sites of bpf_get_btf_vmlinux.
Namhyung Kim [Mon, 14 Mar 2022 18:20:42 +0000 (11:20 -0700)]
selftests/bpf: Test skipping stacktrace
Add a test case for stacktrace with skip > 0 using a small sized
buffer. It didn't support skipping entries greater than or equal to
the size of buffer and filled the skipped part with 0.
Let's say that the caller has storage for num_elem stack frames. Then,
the BPF stack helper functions walk the stack for only num_elem frames.
This means that if skip > 0, one keeps only 'num_elem - skip' frames.
This is because it sets init_nr in the perf_callchain_entry to the end
of the buffer to save num_elem entries only. I believe it was because
the perf callchain code unwound the stack frames until it reached the
global max size (sysctl_perf_event_max_stack).
However it now has perf_callchain_entry_ctx.max_stack to limit the
iteration locally. This simplifies the code to handle init_nr in the
BPF callstack entries and removes the confusion with the perf_event's
__PERF_SAMPLE_CALLCHAIN_EARLY which sets init_nr to 0.
Also change the comment on bpf_get_stack() in the header file to be
more explicit what the return value means.
Song Liu [Fri, 11 Mar 2022 20:11:35 +0000 (12:11 -0800)]
bpf: Select proper size for bpf_prog_pack
Using HPAGE_PMD_SIZE as the size for bpf_prog_pack is not ideal in some
cases. Specifically, for NUMA systems, __vmalloc_node_range requires
PMD_SIZE * num_online_nodes() to allocate huge pages. Also, if the system
does not support huge pages (i.e., with cmdline option nohugevmalloc), it
is better to use PAGE_SIZE packs.
Add logic to select proper size for bpf_prog_pack. This solution is not
ideal, as it makes assumption about the behavior of module_alloc and
__vmalloc_node_range. However, it appears to be the easiest solution as
it doesn't require changes in module_alloc and vmalloc code.
Merge branch 'Make 2-byte access to bpf_sk_lookup->remote_port endian-agnostic'
Jakub Sitnicki says:
====================
This patch set is a result of a discussion we had around the RFC patchset from
Ilya [1]. The fix for the narrow loads from the RFC series is still relevant,
but this series does not depend on it. Nor is it required to unbreak sk_lookup
tests on BE, if this series gets applied.
To summarize the takeaways from [1]:
1) we want to make 2-byte load from ctx->remote_port portable across LE and BE,
2) we keep the 4-byte load from ctx->remote_port as it is today - result varies
on endianess of the platform.
Jakub Sitnicki [Sat, 19 Mar 2022 18:33:56 +0000 (19:33 +0100)]
selftests/bpf: Fix test for 4-byte load from remote_port on big-endian
The context access converter rewrites the 4-byte load from
bpf_sk_lookup->remote_port to a 2-byte load from bpf_sk_lookup_kern
structure.
It means that we cannot treat the destination register contents as a 32-bit
value, or the code will not be portable across big- and little-endian
architectures.
This is exactly the same case as with 4-byte loads from bpf_sock->dst_port
so follow the approach outlined in [1] and treat the register contents as a
16-bit value in the test.
Fixes: 306ecb352f97 ("selftests/bpf: Cover 4-byte load from remote_port in bpf_sk_lookup") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220319183356.233666-4-jakub@cloudflare.com
Above addresses map to ->remote_ip6 field, which precedes ->remote_port,
and are populated during the bpf_sk_lookup IPv6 tests.
Unsurprisingly, on s390x we observe:
#136/38 sk_lookup/narrow access to ctx v4:OK
#136/39 sk_lookup/narrow access to ctx v6:FAIL
Fix it by removing the checks for 1-byte loads from offsets outside of the
->remote_port field.
Fixes: 9f98f541ae0b ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide") Suggested-by: Ilya Leoshkevich <iii@linux.ibm.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220319183356.233666-3-jakub@cloudflare.com
Jakub Sitnicki [Sat, 19 Mar 2022 18:33:54 +0000 (19:33 +0100)]
bpf: Treat bpf_sk_lookup remote_port as a 2-byte field
In commit 9f98f541ae0b ("bpf: Make remote_port field in struct
bpf_sk_lookup 16-bit wide") the remote_port field has been split up and
re-declared from u32 to be16.
However, the accompanying changes to the context access converter have not
been well thought through when it comes big-endian platforms.
Today 2-byte wide loads from offsetof(struct bpf_sk_lookup, remote_port)
are handled as narrow loads from a 4-byte wide field.
This by itself is not enough to create a problem, but when we combine
1. 32-bit wide access to ->remote_port backed by a 16-wide wide load, with
2. inherent difference between litte- and big-endian in how narrow loads
need have to be handled (see bpf_ctx_narrow_access_offset),
we get inconsistent results for a 2-byte loads from &ctx->remote_port on LE
and BE architectures. This in turn makes BPF C code for the common case of
2-byte load from ctx->remote_port not portable.
To rectify it, inform the context access converter that remote_port is
2-byte wide field, and only 1-byte loads need to be treated as narrow
loads.
At the same time, we special-case the 4-byte load from &ctx->remote_port to
continue handling it the same way as do today, in order to keep the
existing BPF programs working.
Fixes: 9f98f541ae0b ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220319183356.233666-2-jakub@cloudflare.com
Merge branch 'Enable non-atomic allocations in local storage'
Joanne Koong says:
====================
From: Joanne Koong <joannelkoong@gmail.com>
Currently, local storage memory can only be allocated atomically
(GFP_ATOMIC). This restriction is too strict for sleepable bpf
programs.
In this patchset, sleepable programs can allocate memory in local
storage using GFP_KERNEL, while non-sleepable programs always default to
GFP_ATOMIC.
v3 <- v2:
* Add extra case to local_storage.c selftest to test associating multiple
elements with the local storage, which triggers a GFP_KERNEL allocation in
local_storage_update().
* Cast gfp_t to __s32 in verifier to fix the sparse warnings
v2 <- v1:
* Allocate the memory before/after the raw_spin_lock_irqsave, depending
on the gfp flags
* Rename mem_flags to gfp_flags
* Reword the comment "*mem_flags* is set by the bpf verifier" to
"*gfp_flags* is a hidden argument provided by the verifier"
* Add a sentence to the commit message about existing local storage
selftests covering both the GFP_ATOMIC and GFP_KERNEL paths in
bpf_local_storage_update.
====================
Joanne Koong [Fri, 18 Mar 2022 04:55:52 +0000 (21:55 -0700)]
bpf: Enable non-atomic allocations in local storage
Currently, local storage memory can only be allocated atomically
(GFP_ATOMIC). This restriction is too strict for sleepable bpf
programs.
In this patch, the verifier detects whether the program is sleepable,
and passes the corresponding GFP_KERNEL or GFP_ATOMIC flag as a
5th argument to bpf_task/sk/inode_storage_get. This flag will propagate
down to the local storage functions that allocate memory.
Please note that bpf_task/sk/inode_storage_update_elem functions are
invoked by userspace applications through syscalls. Preemption is
disabled before bpf_task/sk/inode_storage_update_elem is called, which
means they will always have to allocate memory atomically.
Eliminate anonymous module_init() and module_exit(), which can lead to
confusion or ambiguity when reading System.map, crashes/oops/bugs,
or an initcall_debug log.
Give each of these init and exit functions unique driver-specific
names to eliminate the anonymous names.
Check if the destination register already contains the data that this
tunnel expression performs. This allows to skip this redundant operation.
If the destination contains a different selector, update the register
tracking information. This patch does not perform bitwise tracking.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check if the destination register already contains the data that this
xfrm expression performs. This allows to skip this redundant operation.
If the destination contains a different selector, update the register
tracking information.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check if the destination register already contains the data that this
socket expression performs. This allows to skip this redundant
operation. If the destination contains a different selector, update the
register tracking information.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check if the destination register already contains the data that this
osf expression performs. Always cancel register tracking for jhash since
this requires tracking multiple source registers in case of
concatenations. Perform register tracking (without bitwise) for symhash
since input does not come from source register.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Check if the destination register already contains the data that this ct
expression performs. This allows to skip this redundant operation. If
the destination contains a different selector, update the register
tracking information.
Export nft_expr_reduce_bitwise as a symbol since nft_ct might be
compiled as a module.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: cancel tracking for clobbered destination registers
Output of expressions might be larger than one single register, this might
clobber existing data. Reset tracking for all destination registers that
required to store the expression output.
This patch adds three new helper functions:
- nft_reg_track_update: cancel previous register tracking and update it.
- nft_reg_track_cancel: cancel any previous register tracking info.
- __nft_reg_track_cancel: cancel only one single register tracking info.
Partial register clobbering detection is also supported by checking the
.num_reg field which describes the number of register that are used.
This patch updates the following expressions:
- meta_bridge
- bitwise
- byteorder
- meta
- payload
to use these helper functions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: do not reduce read-only expressions
Skip register tracking for expressions that perform read-only operations
on the registers. Define and use a cookie pointer NFT_REDUCE_READONLY to
avoid defining stubs for these expressions.
This patch re-enables register tracking which was disabled in e7fc6a7a2ee6
("netfilter: nf_tables: disable register tracking"). Follow up patches
add remaining register tracking for existing expressions.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Florian Westphal [Wed, 16 Feb 2022 15:43:05 +0000 (16:43 +0100)]
netfilter: conntrack: revisit gc autotuning
as of commit 737686ed8aff
("netfilter: conntrack: collect all entries in one cycle")
conntrack gc was changed to run every 2 minutes.
On systems where conntrack hash table is set to large value, most evictions
happen from gc worker rather than the packet path due to hash table
distribution.
This causes netlink event overflows when events are collected.
This change collects average expiry of scanned entries and
reschedules to the average remaining value, within 1 to 60 second interval.
To avoid event overflows, reschedule after each bucket and add a
limit for both run time and number of evictions per run.
If more entries have to be evicted, reschedule and restart 1 jiffy
into the future.
Align it with helpers like bpf_find_btf_id, so all functions returning
BTF in out parameter follow the same rule of raising reference
consistently, regardless of module or vmlinux BTF.
Adjust existing callers to handle the change accordinly.