Eric Dumazet [Fri, 7 Jan 2022 18:39:53 +0000 (10:39 -0800)]
af_packet: fix tracking issues in packet_do_bind()
It appears that my changes in packet_do_bind() were
slightly wrong.
syzbot found that calling bind() twice would trigger
a false positive.
Remove proto_curr/dev_curr variables and rewrite things
to be less confusing (like not having to use netdev_tracker_alloc(),
and instead use the standard dev_hold_track())
Fixes: 5f1b96644de5 ("net: add net device refcount tracker to struct packet_type") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Link: https://lore.kernel.org/r/20220107183953.3886647-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
mptcp: Refactoring for one selftest and csum validation
Patch 1 changes the MPTCP join self tests to depend more on events
rather than delays, so the script runs faster and has more consistent
results.
Patches 2 and 3 get rid of some duplicate code in MPTCP's checksum
validation by modifying and leveraging an existing helper function.
====================
Paolo Abeni [Fri, 7 Jan 2022 19:25:22 +0000 (11:25 -0800)]
selftests: mptcp: more stable join tests-cases
MPTCP join self-tests are a bit fragile as they reply on
delays instead of events to catch-up with the expected
sockets states.
Replace the delay with state checking where possible and
reduce the number of sleeps in the most complex scenarios.
This will both reduce the tests run-time and will improve
stability.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 7 Jan 2022 14:42:29 +0000 (16:42 +0200)]
net: dsa: felix: add port fast age support
Add support for flushing the MAC table on a given port in the ocelot
switch library, and use this functionality in the felix DSA driver.
This operation is needed when a port leaves a bridge to become
standalone, and when the learning is disabled, and when the STP state
changes to a state where no FDB entry should be present.
Vladimir Oltean [Fri, 7 Jan 2022 16:43:32 +0000 (18:43 +0200)]
net: mscc: ocelot: fix incorrect balancing with down LAG ports
Assuming the test setup described here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210205130240.4072854-1-vladimir.oltean@nxp.com/
(swp1 and swp2 are in bond0, and bond0 is in a bridge with swp0)
it can be seen that when swp1 goes down (on either board A or B), then
traffic that should go through that port isn't forwarded anywhere.
In other words, in the "bad" configuration, the attempt is to remove the
inactive swp1 from the destination ports via PGID_DST. But when a MAC
table entry is learned, it is learned towards PGID_DST 1, because that
is the logical port id of the LAG itself (it is equal to the lowest
numbered member port). So when swp1 becomes inactive, if we set
PGID_DST[1] to contain just swp1 and not swp2, the packet will not have
any chance to reach the destination via swp2.
The "correct" way to remove swp1 as a destination is via PGID_AGGR
(remove swp1 from the aggregation port groups for all aggregation
codes). This means that PGID_DST[1] and PGID_DST[2] must still contain
both swp1 and swp2. This makes the MAC table still treat packets
destined towards the single-port LAG as "multicast", and the inactive
ports are removed via the aggregation code tables.
The change presented here is a design one: the ocelot_get_bond_mask()
function used to take an "only_active_ports" argument. We don't need
that. The only call site that specifies only_active_ports=true,
ocelot_set_aggr_pgids(), must retrieve the entire bonding mask, because
it must program that into PGID_DST. Additionally, it must also clear the
inactive ports from the bond mask here, which it can't do if bond_mask
just contains the active ports:
ac = ocelot_read_rix(ocelot, ANA_PGID_PGID, i);
ac &= ~bond_mask; <---- here
/* Don't do division by zero if there was no active
* port. Just make all aggregation codes zero.
*/
if (num_active_ports)
ac |= BIT(aggr_idx[i % num_active_ports]);
ocelot_write_rix(ocelot, ac, ANA_PGID_PGID, i);
So it becomes the responsibility of ocelot_set_aggr_pgids() to take
ocelot_port->lag_tx_active into consideration when populating the
aggr_idx array.
Jakub Kicinski [Sat, 8 Jan 2022 02:51:46 +0000 (18:51 -0800)]
Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
40GbE Intel Wired LAN Driver Updates 2022-01-07
This series contains updates to i40e and iavf drivers.
Karen limits per VF MAC filters so that one VF does not consume all
filters for i40e.
Jedrzej reduces busy wait time for admin queue calls for i40e.
Mateusz updates firmware versions to reflect new supported NVM images
and renames an error to remove non-inclusive language for i40e.
Yang Li fixes a set but not used warning for i40e.
Jason Wang removes an unneeded variable for iavf.
* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
iavf: remove an unneeded variable
i40e: remove variables set but not used
i40e: Remove non-inclusive language
i40e: Update FW API version
i40e: Minimize amount of busy-waiting during AQ send
i40e: Add ensurance of MacVlan resources for every trusted VF
====================
Gal Pressman [Sun, 2 Jan 2022 08:12:53 +0000 (10:12 +0200)]
net/tls: Fix skb memory leak when running kTLS traffic
The cited Fixes commit introduced a memory leak when running kTLS
traffic (with/without hardware offloads).
I'm running nginx on the server side and wrk on the client side and get
the following:
I'm not familiar with these areas of the code, but I've added this
sk_defer_free_flush() to tls_sw_recvmsg() based on a hunch and it
resolved the issue.
Fixes: 4ae7ff61863f ("tcp: defer skb freeing after socket lock is released") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220102081253.9123-1-gal@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Wang [Sun, 12 Dec 2021 08:10:01 +0000 (16:10 +0800)]
iavf: remove an unneeded variable
The variable `ret_code' used for returning is never changed in function
`iavf_shutdown_adminq'. So that it can be removed and just return its
initial value 0 at the end of `iavf_shutdown_adminq' function.
Signed-off-by: Jason Wang <wangborong@cdjrlc.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Yang Li [Mon, 13 Dec 2021 03:11:07 +0000 (11:11 +0800)]
i40e: remove variables set but not used
The code that uses variables pe_cntx_size and pe_filt_size
has been removed, so they should be removed as well.
Eliminate the following clang warnings:
drivers/net/ethernet/intel/i40e/i40e_common.c:4139:20:
warning: variable 'pe_filt_size' set but not used.
drivers/net/ethernet/intel/i40e/i40e_common.c:4139:6:
warning: variable 'pe_cntx_size' set but not used.
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Tested-by: Gurucharan G <gurucharanx.g@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
i40e: Minimize amount of busy-waiting during AQ send
The i40e_asq_send_command will now use a non blocking usleep_range if
possible (non-atomic context), instead of busy-waiting udelay. The
usleep_range function uses hrtimers to provide better performance and
removes the negative impact of busy-waiting in time-critical
environments.
1. Rename i40e_asq_send_command to i40e_asq_send_command_atomic
and add 5th parameter to inform if called from an atomic context.
Call inside usleep_range (if non-atomic) or udelay (if atomic).
2. Change i40e_asq_send_command to invoke
i40e_asq_send_command_atomic(..., false).
3. Change two functions:
- i40e_aq_set_vsi_uc_promisc_on_vlan
- i40e_aq_set_vsi_mc_promisc_on_vlan
to explicitly use i40e_asq_send_command_atomic(..., true)
instead of i40e_asq_send_command, as they use spinlocks and do some
work in an atomic context.
All other calls to i40e_asq_send_command remain unchanged.
Signed-off-by: Dawid Lukwinski <dawid.lukwinski@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Tony Brelinski <tony.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Karen Sornek [Thu, 17 Jun 2021 07:19:26 +0000 (09:19 +0200)]
i40e: Add ensurance of MacVlan resources for every trusted VF
Trusted VF can use up every resource available, leaving nothing
to other trusted VFs.
Introduce define, which calculates MacVlan resources available based
on maximum available MacVlan resources, bare minimum for each VF and
number of currently allocated VFs.
Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Karen Sornek <karen.sornek@intel.com> Tested-by: Tony Brelinski <tony.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Kevin Bracey [Thu, 6 Jan 2022 21:56:37 +0000 (23:56 +0200)]
sch_cake: revise Diffserv docs
Documentation incorrectly stated that CS1 is equivalent to LE for
diffserv8. But when LE was added to the table, CS1 was pushed into tin
1, leaving only LE in tin 0.
Also "TOS1" no longer exists, as that is the same codepoint as LE.
Make other tweaks properly distinguishing codepoints from classes and
putting current Diffserve codepoints ahead of legacy ones.
David S. Miller [Fri, 7 Jan 2022 11:27:07 +0000 (11:27 +0000)]
Merge branch 'mptcp-next'
Mat Martineau says:
====================
mptcp: New features and cleanup
These patches have been tested in the MPTCP tree for a longer than usual
time (thanks to holiday schedules), and are ready for the net-next
branch. Changes include feature updates, small fixes, refactoring, and
some selftest changes.
Patch 1 fixes an OUTQ ioctl issue with TCP fallback sockets.
Patches 2, 3, and 6 add support of the MPTCP fastclose option (quick
shutdown of the full MPTCP connection, similar to TCP RST in regular
TCP), and a related self test.
Patch 4 cleans up some accept and poll code that is no longer needed
after the fastclose changes.
Patch 5 add userspace disconnect using AF_UNSPEC, which is used when
testing fastclose and makes the MPTCP socket's handling of AF_UNSPEC in
connect() more TCP-like.
Patches 7-11 refactor subflow creation to make better use of multiple
local endpoints and to better handle individual connection failures when
creating multiple subflows. Includes self test updates.
Patch 12 cleans up the way subflows are added to the MPTCP connection
list, eliminating the need for calls throughout the MPTCP code that had
to check the intermediate "join list" for entries to shift over to the
main "connection list".
Patch 13 refactors the MPTCP release_cb flags to use separate storage
for values only accessed with the socket lock held (no atomic ops
needed), and for values that need atomic operations.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:26 +0000 (16:20 -0800)]
mptcp: avoid atomic bit manipulation when possible
Currently the msk->flags bitmask carries both state for the
mptcp_release_cb() - mostly touched under the mptcp data lock
- and others state info touched even outside such lock scope.
As a consequence, msk->flags is always manipulated with
atomic operations.
This change splits such bitmask in two separate fields, so
that we use plain bit operations when touching the
cb-related info.
The MPTCP_PUSH_PENDING bit needs additional care, as it is the
only CB related field currently accessed either under the mptcp
data lock or the mptcp socket lock.
Let's add another mask just for such bit's sake.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:25 +0000 (16:20 -0800)]
mptcp: cleanup MPJ subflow list handling
We can simplify the join list handling leveraging the
mptcp_release_cb(): if we can acquire the msk socket
lock at mptcp_finish_join time, move the new subflow
directly into the conn_list, otherwise place it on join_list and
let the release_cb process such list.
Since pending MPJ connection are now always processed
in a timely way, we can avoid flushing the join list
every time we have to process all the current subflows.
Additionally we can now use the mptcp data lock to protect
the join_list, removing the additional spin lock.
Finally, the MPJ handshake is now always finalized under the
msk socket lock, we can drop the additional synchronization
between mptcp_finish_join() and mptcp_close().
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:24 +0000 (16:20 -0800)]
selftests: mptcp: add tests for subflow creation failure
Verify that, when multiple endpoints are available, subflows
creation proceed even when the first additional subflow creation
fails - due to packet drop on the relevant link
Co-developed-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:23 +0000 (16:20 -0800)]
mptcp: do not block subflows creation on errors
If the MPTCP configuration allows for multiple subflows
creation, and the first additional subflows never reach
the fully established status - e.g. due to packets drop or
reset - the in kernel path manager do not move to the
next subflow.
This patch introduces a new PM helper to cope with MPJ
subflow creation failure and delay and hook it where appropriate.
Such helper triggers additional subflow creation, as needed
and updates the PM subflow counter, if the current one is
closing.
Additionally start all the needed additional subflows
as soon as the MPTCP socket is fully established, so we don't
have to cope with slow MPJ handshake blocking the next subflow
creation.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:22 +0000 (16:20 -0800)]
mptcp: keep track of local endpoint still available for each msk
Include into the path manager status a bitmap tracking the list
of local endpoints still available - not yet used - for the
relevant mptcp socket.
Keep such map updated at endpoint creation/deletion time, so
that we can easily skip already used endpoint at local address
selection time.
The endpoint used by the initial subflow is lazyly accounted at
subflow creation time: the usage bitmap is be up2date before
endpoint selection and we avoid such unneeded task in some relevant
scenarios - e.g. busy servers accepting incoming subflows but
not creating any additional ones nor annuncing additional addresses.
Overall this allows for fair local endpoints usage in case of
subflow failure.
As a side effect, this patch also enforces that each endpoint
is used at most once for each mptcp connection.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:21 +0000 (16:20 -0800)]
mptcp: clean-up MPJ option writing
Check for all MPJ variant at once, this reduces the number
of conditionals traversed on average and will simplify the
next patch.
No functional change intended.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:20 +0000 (16:20 -0800)]
mptcp: fix per socket endpoint accounting
Since full-mesh endpoint support, the reception of a single ADD_ADDR
option can cause multiple subflows creation. When such option is
accepted we increment 'add_addr_accepted' by one. When we received
a paired RM_ADDR option, we deleted all the relevant subflows,
decrementing 'add_addr_accepted' by one for each of them.
We have a similar issue for 'local_addr_used'
Fix them moving the pm endpoint accounting outside the subflow
traversal.
Fixes: eadd062c09df ("mptcp: local addresses fullmesh") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:19 +0000 (16:20 -0800)]
selftests: mptcp: add disconnect tests
Performs several disconnect/reconnect on the same socket,
ensuring the overall transfer is succesful.
The new test leverages ioctl(SIOCOUTQ) to ensure all the
pending data is acked before disconnecting.
Additionally order alphabetically the test program arguments list
for better maintainability.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:18 +0000 (16:20 -0800)]
mptcp: implement support for user-space disconnect
Handle explicitly AF_UNSPEC in mptcp_stream_connnect() to
allow user-space to disconnect established MPTCP connections
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:17 +0000 (16:20 -0800)]
mptcp: cleanup accept and poll
After the previous patch, msk->subflow will never be deleted during
the whole msk lifetime. We don't need anymore to acquire references to
it in mptcp_stream_accept() and we can use the listener subflow accept
queue to simplify mptcp_poll() for listener socket.
Overall this removes a lock pair and 4 more atomic operations per
accept().
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:16 +0000 (16:20 -0800)]
mptcp: full disconnect implementation
The current mptcp_disconnect() implementation lacks several
steps, we additionally need to reset the msk socket state
and flush the subflow list.
Factor out the needed helper to avoid code duplication.
Additionally ensure that the initial subflow is disposed
only after mptcp_close(), just reset it at disconnect time.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 7 Jan 2022 00:20:14 +0000 (16:20 -0800)]
mptcp: keep snd_una updated for fallback socket
After shutdown, for fallback MPTCP sockets, we always have
write_seq == snd_una+1
The above will foul OUTQ ioctl(). Keep snd_una in sync with
write_seq even after shutdown.
Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 7 Jan 2022 11:10:57 +0000 (11:10 +0000)]
Merge tag 'mlx5-updates-2022-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-01-06
1) Expose FEC per lane block counters via ethtool
2) Trivial fixes/updates/cleanup to mlx5e netdev driver
3) Fix htmldoc build warning
4) Spread mlx5 SFs (sub-functions) to all available CPU cores: Commits 1..5
Shay Drory Says:
================
Before this patchset, mlx5 subfunction shared the same IRQs (MSI-X) with
their peers subfunctions, causing them to use same CPU cores.
In large scale, this is very undesirable, SFs use small number of cpu
cores and all of them will be packed on the same CPU cores, not
utilizing all CPU cores in the system.
In this patchset we want to achieve two things.
a) Spread IRQs used by SFs to all cpu cores
b) Pack less SFs in the same IRQ, will result in multiple IRQs per core.
In this patchset, we spread SFs over all online cpus available to mlx5
irqs in Round-Robin manner. e.g.: Whenever a SF is created, pick the next
CPU core with least number of SF IRQs bound to it, SFs will share IRQs on
the same core until a certain limit, when such limit is reached, we
request a new IRQ and add it to that CPU core IRQ pool, when out of IRQs,
pick any IRQ with least number of SF users.
This enhancement is done in order to achieve a better distribution of
the SFs over all the available CPUs, which reduces application latency,
as shown bellow.
Machine details:
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 cores.
PCI Express 3 with BW of 126 Gb/s.
ConnectX-5 Ex; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0
x16.
Base line test description:
Single SF on the system. One instance of netperf is running on-top the
SF.
Numbers: latency = 15.136 usec, CPU Util = 35%
Test description:
There are 250 SFs on the system. There are 3 instances of netperf
running, on-top three different SFs, in parallel.
- The first two entries (#1 and #2) show current state. e.g.: SFs are
using the same CPU. The last two entries (#3 and #4) shows the latency
reduction improvement of this patch. e.g.: SFs are on different CPUs.
- Whenever we use several CPUs, in case there is a different CPU
utilization, write the utilization of each CPU separately.
- Whenever the latency result of the netperf instances were different,
write the latency of each netperf instances separately.
Commands:
- for netperf CPU=0:
$ for i in {1..3}; do taskset -c 0 netperf -H 1${i}.1.1.1 -t TCP_RR -- \
-o RT_LATENCY -r8 & done
- for netperf CPU=0,2,4
$ for i in {1..3}; do taskset -c $(( ($i - 1) * 2 )) netperf -H \
1${i}.1.1.1 -t TCP_RR -- -o RT_LATENCY -r8 & done
================
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jakub Kicinski [Fri, 7 Jan 2022 04:06:32 +0000 (20:06 -0800)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
100GbE Intel Wired LAN Driver Updates 2022-01-06
Victor adds restoring of advanced rules after reset.
Wojciech improves usage of switchdev control VSI by utilizing the
device's advanced rules for forwarding.
Christophe Jaillet removes some unneeded calls to zero bitmaps, changes
some bitmap operations that don't need to be atomic, and converts a
kfree() to a more appropriate bitmap_free().
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: Use bitmap_free() to free bitmap
ice: Optimize a few bitmap operations
ice: Slightly simply ice_find_free_recp_res_idx
ice: improve switchdev's slow-path
ice: replay advanced rules after reset
====================
Jakub Kicinski [Fri, 7 Jan 2022 04:00:48 +0000 (20:00 -0800)]
Merge branch 'mlxsw-add-spectrum-4-support'
Ido Schimmel says:
====================
mlxsw: Add Spectrum-4 support
This patchset adds Spectrum-4 support in mlxsw. It builds on top of a
previous patchset merged in commit 7f7afa9679d4 ("Merge branch
'mlxsw-Spectrum-4-prep'") and makes two additional changes before adding
Spectrum-4 support.
Patchset overview:
Patches #1-#2 add a few Spectrum-4 specific variants of existing ACL
keys. The new variants are needed because the size of certain key
elements (e.g., local port) was increased in Spectrum-4.
Patches #3-#6 are preparations.
Patch #7 implements the Spectrum-4 variant of the Bloom filter hash
function. The Bloom filter is used to optimize ACL lookups by
potentially skipping certain lookups if they are guaranteed not to
match. See additional info in merge commit 71d1ced5466d ("Merge branch
'mlxsw-spectrum_acl-Add-Bloom-filter-support'").
Amit Cohen [Thu, 6 Jan 2022 16:06:51 +0000 (18:06 +0200)]
mlxsw: spectrum_acl_bloom_filter: Add support for Spectrum-4 calculation
Spectrum-4 will calculate hash function for bloom filter differently
from the existing ASICs.
First, two hash functions will be used to calculate 16 bits result.
The final result will be combination of the two results - 6 bits which
are result of CRC-6 will be used as MSB and 10 bits which are result of
CRC-10 will be used as LSB.
Second, while in Spectrum{2,3}, there is a padding in each chunk, so the
chunks use a sequence of whole bytes, in Spectrum-4 there is no padding,
so each chunk use 20 bytes minus 2 bits, so it is necessary to align the
chunks to be without holes.
Add dedicated 'mlxsw_sp_acl_bf_ops' for Spectrum-4 and add the required
tables for CRC calculations.
All the details are documented as part of the code for future use.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Thu, 6 Jan 2022 16:06:50 +0000 (18:06 +0200)]
mlxsw: Add operations structure for bloom filter calculation
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
There are two changes:
1. Instead of using one hash function to calculate 16 bits output (CRC-16),
two functions will be used.
2. The chunks will be built differently, without padding.
As preparation for support of Spectrum-4 bloom filter, add 'ops'
structure to allow handling different calculation for different ASICs.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Thu, 6 Jan 2022 16:06:49 +0000 (18:06 +0200)]
mlxsw: spectrum_acl_bloom_filter: Rename Spectrum-2 specific objects for future use
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
There are two changes:
1. Instead of using one hash function to calculate 16 bits output (CRC-16),
two functions will be used.
2. The chunks will be built differently, without padding.
As preparation for support of Spectrum-4 bloom filter, rename CRC table
to include "sp2" prefix and "crc16", as next patch will add two additional
tables. In addition, rename all the dedicated functions and defines for
Spectrum-{2,3} to include "sp2" prefix.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Thu, 6 Jan 2022 16:06:48 +0000 (18:06 +0200)]
mlxsw: spectrum_acl_bloom_filter: Make mlxsw_sp_acl_bf_key_encode() more flexible
Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.
One of the changes is related to the way that the chunks will be build -
without padding.
As preparation for support of Spectrum-4 bloom filter, make
mlxsw_sp_acl_bf_key_encode() more flexible, so it will be able to use it
for Spectrum-4 as well.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Thu, 6 Jan 2022 16:06:46 +0000 (18:06 +0200)]
mlxsw: Introduce flex key elements for Spectrum-4
Spectrum-4 ASIC will support more virtual routers and local ports
compared to the existing ASICs. Therefore, the virtual router and local
port ACL key elements need to be increased.
Introduce new key elements for Spectrum-4 to be aligned with the elements
used already for other Spectrum ASICs.
The key blocks layout is the same for Spectrum-4, so use the existing
code for encode_block() and clear_block(), just create separate blocks.
Note that size of `VIRT_ROUTER_MSB` is 4 bits in Spectrum-4,
therefore declare it using `MLXSW_AFK_ELEMENT_INST_U32()`, in order to
be able to set `.avoid_size_check` to true.
Otherwise, `mlxsw_afk_blocks_check()` will fail and warn.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Amit Cohen [Thu, 6 Jan 2022 16:06:45 +0000 (18:06 +0200)]
mlxsw: Rename virtual router flex key element
In Spectrum-4, the size of the virtual router ACL key element increased
from 11 bits to 12 bits.
In order to reuse the existing virtual router ACL key element
enumerators for Spectrum-4, rename 'VIRT_ROUTER_8_10' and
'VIRT_ROUTER_0_7' to 'VIRT_ROUTER_MSB' and 'VIRT_ROUTER_LSB',
respectively.
No functional changes intended.
Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 7 Jan 2022 03:56:31 +0000 (19:56 -0800)]
Merge branch 'dpaa2-eth-small-cleanup'
Ioana Ciornei says:
====================
dpaa2-eth: small cleanup
These 3 patches are just part of a small cleanup on the dpaa2-eth and
the dpaa2-switch drivers.
In case we are hitting a case in which the fwnode of the root dprc
device we initiate a deferred probe. On the dpaa2-switch side, if we are
on the remove path, make sure that we check for a non-NULL pointer
before accessing the port private structure.
====================
Ioana Ciornei [Thu, 6 Jan 2022 13:59:05 +0000 (15:59 +0200)]
dpaa2-switch: check if the port priv is valid
Before accessing the port private structure make sure that there is
still a non-NULL pointer there. A NULL pointer access can happen when we
are on the remove path, some switch ports are unregistered and some are
in the process of unregistering.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ioana Ciornei [Thu, 6 Jan 2022 13:59:04 +0000 (15:59 +0200)]
dpaa2-mac: return -EPROBE_DEFER from dpaa2_mac_open in case the fwnode is not set
We could get into a situation when the fwnode of the parent device is
not yet set because its probe didn't yet finish. When this happens, any
caller of the dpaa2_mac_open() will not have the fwnode available, thus
cause problems at the PHY connect time.
Avoid this by just returning -EPROBE_DEFER from the dpaa2_mac_open when
this happens.
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The parent pointer node handler must be declared with a NULL
initializer. Before using it, a check must be performed to make
sure that a valid address has been assigned to it.
Signed-off-by: Robert-Ionut Alexa <robert-ionut.alexa@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We've added 41 non-merge commits during the last 2 day(s) which contain
a total of 36 files changed, 1214 insertions(+), 368 deletions(-).
The main changes are:
1) Various fixes in the verifier, from Kris and Daniel.
2) Fixes in sockmap, from John.
3) bpf_getsockopt fix, from Kuniyuki.
4) INET_POST_BIND fix, from Menglong.
5) arm64 JIT fix for bpf pseudo funcs, from Hou.
6) BPF ISA doc improvements, from Christoph.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (41 commits)
bpf: selftests: Add bind retry for post_bind{4, 6}
bpf: selftests: Use C99 initializers in test_sock.c
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
bpf/selftests: Test bpf_d_path on rdonly_mem.
libbpf: Add documentation for bpf_map batch operations
selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3
xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames
xdp: Move conversion to xdp_frame out of map functions
page_pool: Store the XDP mem id
page_pool: Add callback to init pages when they are allocated
xdp: Allow registering memory model without rxq reference
samples/bpf: xdpsock: Add timestamp for Tx-only operation
samples/bpf: xdpsock: Add time-out for cleaning Tx
samples/bpf: xdpsock: Add sched policy and priority support
samples/bpf: xdpsock: Add cyclic TX operation capability
samples/bpf: xdpsock: Add clockid selection support
samples/bpf: xdpsock: Add Dest and Src MAC setting for Tx-only operation
samples/bpf: xdpsock: Add VLAN support for Tx-only operation
libbpf 1.0: Deprecate bpf_object__find_map_by_offset() API
libbpf 1.0: Deprecate bpf_map__is_offload_neutral()
...
====================
Merge branch 'net: bpf: handle return value of post_bind{4,6} and add selftests for it'
Menglong Dong says:
====================
From: Menglong Dong <imagedong@tencent.com>
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
The second patch use C99 initializers in test_sock.c
The third patch is the selftests for this modification.
Changes since v4:
- use C99 initializers in test_sock.c before adding the test case
Changes since v3:
- add the third patch which use C99 initializers in test_sock.c
Changes since v2:
- NULL check for sk->sk_prot->put_port
Changes since v1:
- introduce 'put_port' field for 'struct proto'
- add selftests for it
====================
Menglong Dong [Thu, 6 Jan 2022 13:20:22 +0000 (21:20 +0800)]
bpf: selftests: Add bind retry for post_bind{4, 6}
With previous patch, kernel is able to 'put_port' after sys_bind()
fails. Add the test for that case: rebind another port after
sys_bind() fails. If the bind success, it means previous bind
operation is already undoed.
Menglong Dong [Thu, 6 Jan 2022 13:20:20 +0000 (21:20 +0800)]
net: bpf: Handle return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND()
The return value of BPF_CGROUP_RUN_PROG_INET{4,6}_POST_BIND() in
__inet_bind() is not handled properly. While the return value
is non-zero, it will set inet_saddr and inet_rcv_saddr to 0 and
exit:
Let's take UDP for example and see what will happen. For UDP
socket, it will be added to 'udp_prot.h.udp_table->hash' and
'udp_prot.h.udp_table->hash2' after the sk->sk_prot->get_port()
called success. If 'inet->inet_rcv_saddr' is specified here,
then 'sk' will be in the 'hslot2' of 'hash2' that it don't belong
to (because inet_saddr is changed to 0), and UDP packet received
will not be passed to this sock. If 'inet->inet_rcv_saddr' is not
specified here, the sock will work fine, as it can receive packet
properly, which is wired, as the 'bind()' is already failed.
To undo the get_port() operation, introduce the 'put_port' field
for 'struct proto'. For TCP proto, it is inet_put_port(); For UDP
proto, it is udp_lib_unhash(); For icmp proto, it is
ping_unhash().
Therefore, after sys_bind() fail caused by
BPF_CGROUP_RUN_PROG_INET4_POST_BIND(), it will be unbinded, which
means that it can try to be binded to another port.
Documentation/networking/devlink/mlx5.rst:13: WARNING: Error parsing content block for the "list-table" directive:
+uniform two-level bullet list expected, but row 2 does not contain the same number of items as row 1 (2 vs 3).
...
Add the missing item in the first row.
Fixes: e87cbceaa6ae ("net/mlx5: Let user configure io_eq_size param") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Gal Pressman [Wed, 22 Dec 2021 12:03:39 +0000 (14:03 +0200)]
net/mlx5e: Add recovery flow in case of error CQE
The rep legacy RQ completion handling was missing the appropriate
handling of error CQEs (dump the CQE and queue a recover work), fix it
by calling trigger_report() when needed.
Since all CQE handling flows do the exact same error CQE handling,
extract it to a common helper function.
Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Roi Dayan [Mon, 3 Jan 2022 08:57:01 +0000 (10:57 +0200)]
net/mlx5e: TC, Remove redundant error logging
Remove redundant and trivial error logging when trying to
offload mirred device with unsupported devices.
Using OVS could hit those a lot and the errors are still
logged in extack.
Gal Pressman [Mon, 29 Nov 2021 08:57:31 +0000 (10:57 +0200)]
net/mlx5e: Move HW-GRO and CQE compression check to fix features flow
Feature dependencies should be resolved in fix features rather than in
set features flow. Move the check that disables HW-GRO in case CQE
compression is enabled from set_feature_hw_gro() to
mlx5e_fix_features().
Signed-off-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maor Dickman [Thu, 9 Dec 2021 12:03:01 +0000 (14:03 +0200)]
net/mlx5e: Unblock setting vid 0 for VF in case PF isn't eswitch manager
When using libvirt to passthrough VF to VM it will always set the VF vlan
to 0 even if user didn’t request it, this will cause libvirt to fail to
boot in case the PF isn't eswitch owner.
Example of such case is the DPU host PF which isn't eswitch manager, so
any attempt to passthrough VF of it using libvirt will fail.
Fix it by not returning error in case set VF vlan is called with vid 0.
Lama Kayal [Mon, 13 Sep 2021 13:06:35 +0000 (16:06 +0300)]
net/mlx5e: Expose FEC counters via ethtool
Add FEC counters' statistics of corrected_blocks and
uncorrectable_blocks, along with their lanes via ethtool.
HW supports corrected_blocks and uncorrectable_blocks counters both for
RS-FEC mode and FC-FEC mode. In FC mode these counters are accumulated
per lane, while in RS mode the correction method crosses lanes, thus
only total corrected_blocks and uncorrectable_blocks are reported in
this mode.
Signed-off-by: Lama Kayal <lkayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maher Sanalla [Wed, 5 Jan 2022 12:50:11 +0000 (14:50 +0200)]
net/mlx5: Update log_max_qp value to FW max capability
log_max_qp in driver's default profile #2 was set to 18, but FW actually
supports 17 at the most - a situation that led to the concerning print
when the driver is loaded:
"log_max_qp value in current profile is 18, changing to HCA capabaility
limit (17)"
The expected behavior from mlx5_profile #2 is to match the maximum FW
capability in regards to log_max_qp. Thus, log_max_qp in profile #2 is
initialized to a defined static value (0xff) - which basically means that
when loading this profile, log_max_qp value will be what the currently
installed FW supports at most.
Shay Drory [Tue, 23 Nov 2021 10:50:19 +0000 (12:50 +0200)]
net/mlx5: SF, Use all available cpu for setting cpu affinity
Currently all SFs are using the same CPUs. Spreading SF over CPUs, in
round-robin manner, in order to achieve better distribution of the SFs
over available CPUs.
Shay Drory [Sun, 12 Dec 2021 12:51:27 +0000 (14:51 +0200)]
net/mlx5: Introduce API for bulk request and release of IRQs
Currently IRQs are requested one by one. To balance spreading IRQs
among cpus using such scheme requires remembering cpu mask for the
cpus used for a given device. This complicates the IRQ allocation
scheme in subsequent patch.
Hence, prepare the code for bulk IRQs allocation. This enables
spreading IRQs among cpus in subsequent patch.
Shay Drory [Tue, 23 Nov 2021 08:48:07 +0000 (10:48 +0200)]
net/mlx5: Split irq_pool_affinity logic to new file
The downstream patches add more functionality to irq_pool_affinity.
Move the irq_pool_affinity logic to a new file in order to ease the
coding and maintenance of it.
Shay Drory [Sun, 14 Nov 2021 11:01:21 +0000 (13:01 +0200)]
net/mlx5: Introduce control IRQ request API
Currently, IRQ layer have a separate flow for ctrl and comp IRQs, and
the distinction between ctrl and comp IRQs is done in the IRQ layer.
In order to ease the coding and maintenance of the IRQ layer,
introduce a new API for requesting control IRQs -
mlx5_ctrl_irq_request(struct mlx5_core_dev *dev).
Hao Luo [Thu, 6 Jan 2022 20:55:25 +0000 (12:55 -0800)]
bpf/selftests: Test bpf_d_path on rdonly_mem.
The second parameter of bpf_d_path() can only accept writable
memories. Rdonly_mem obtained from bpf_per_cpu_ptr() can not
be passed into bpf_d_path for modification. This patch adds
a selftest to verify this behavior.
This also updates the public API for the `keys` parameter
of `bpf_map_delete_batch()`, and both the
`keys` and `values` parameters of `bpf_map_update_batch()`
to be constants.
Andrii Nakryiko [Thu, 6 Jan 2022 20:51:56 +0000 (12:51 -0800)]
selftests/bpf: Don't rely on preserving volatile in PT_REGS macros in loop3
PT_REGS*() macro on some architectures force-cast struct pt_regs to
other types (user_pt_regs, etc) and might drop volatile modifiers, if any.
Volatile isn't really required as pt_regs value isn't supposed to change
during the BPF program run, so this is correct behavior.
But progs/loop3.c relies on that volatile modifier to ensure that loop
is preserved. Fix loop3.c by declaring i and sum variables as volatile
instead. It preserves the loop and makes the test pass on all
architectures (including s390x which is currently broken).
The 'possible_idx' bitmap is set just after it is zeroed, so we can save
the first step.
The 'free_idx' bitmap is used only at the end of the function as the
result of a bitmap xor operation. So there is no need to explicitly
zero it before.
So, slightly simply the code and remove 2 useless 'bitmap_zero()' call
Wojciech Drewek [Tue, 26 Oct 2021 10:38:40 +0000 (12:38 +0200)]
ice: improve switchdev's slow-path
In current switchdev implementation, every VF PR is assigned to
individual ring on switchdev ctrl VSI. For slow-path traffic, there
is a mapping VF->ring done in software based on src_vsi value (by
calling ice_eswitch_get_target_netdev function).
With this change, HW solution is introduced which is more
efficient. For each VF, src MAC (VF's MAC) filter will be created,
which forwards packets to the corresponding switchdev ctrl VSI queue
based on src MAC address.
This filter has to be removed and then replayed in case of
resetting one VF. Keep information about this rule in repr->mac_rule,
thanks to that we know which rule has to be removed and replayed
for a given VF.
In case of CORE/GLOBAL all rules are removed
automatically. We have to take care of readding them. This is done
by ice_replay_vsi_adv_rule.
When driver leaves switchdev mode, remove all advanced rules
from switchdev ctrl VSI. This is done by ice_rem_adv_rule_for_vsi.
Flag repr->rule_added is needed because in some cases reset
might be triggered before VF sends request to add MAC.
Co-developed-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Grzegorz Nitka <grzegorz.nitka@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Victor Raj [Tue, 26 Oct 2021 10:38:39 +0000 (12:38 +0200)]
ice: replay advanced rules after reset
ice_replay_vsi_adv_rule will replay advanced rules for a given VSI.
Exit this function when list of rules for given recipe is empty.
Do not add rule when given vsi_handle does not match vsi_handle
from the rule info.
Use ICE_MAX_NUM_RECIPES instead of ICE_SW_LKUP_LAST in order to find
advanced rules as well.
Signed-off-by: Victor Raj <victor.raj@intel.com> Signed-off-by: Wojciech Drewek <wojciech.drewek@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Daniel Borkmann [Thu, 6 Jan 2022 00:46:06 +0000 (01:46 +0100)]
veth: Do not record rx queue hint in veth_xmit
Laurent reported that they have seen a significant amount of TCP retransmissions
at high throughput from applications residing in network namespaces talking to
the outside world via veths. The drops were seen on the qdisc layer (fq_codel,
as per systemd default) of the phys device such as ena or virtio_net due to all
traffic hitting a _single_ TX queue _despite_ multi-queue device. (Note that the
setup was _not_ using XDP on veths as the issue is generic.)
More specifically, after ce8d22f4bd21 ("veth: Store queue_mapping independently
of XDP prog presence") which made it all the way back to v4.19.184+,
skb_record_rx_queue() would set skb->queue_mapping to 1 (given 1 RX and 1 TX
queue by default for veths) instead of leaving at 0.
This is eventually retained and callbacks like ena_select_queue() will also pick
single queue via netdev_core_pick_tx()'s ndo_select_queue() once all the traffic
is forwarded to that device via upper stack or other means. Similarly, for others
not implementing ndo_select_queue() if XPS is disabled, netdev_pick_tx() might
call into the skb_tx_hash() and check for prior skb_rx_queue_recorded() as well.
In general, it is a _bad_ idea for virtual devices like veth to mess around with
queue selection [by default]. Given dev->real_num_tx_queues is by default 1,
the skb->queue_mapping was left untouched, and so prior to ce8d22f4bd21 the
netdev_core_pick_tx() could do its job upon __dev_queue_xmit() on the phys device.
Unbreak this and restore prior behavior by removing the skb_record_rx_queue()
from veth_xmit() altogether.
If the veth peer has an XDP program attached, then it would return the first RX
queue index in xdp_md->rx_queue_index (unless configured in non-default manner).
However, this is still better than breaking the generic case.
Fixes: ce8d22f4bd21 ("veth: Store queue_mapping independently of XDP prog presence") Fixes: 726112bb1385 ("veth: Support per queue XDP ring") Reported-by: Laurent Bernaille <laurent.bernaille@datadoghq.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Cc: Toshiaki Makita <toshiaki.makita1@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Paolo Abeni <pabeni@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ethernet: ibmveth: use default_groups in kobj_type
There are currently 2 ways to create a set of sysfs files for a
kobj_type, through the default_attrs field, and the default_groups
field. Move the ibmveth sysfs code to use default_groups
field which has been the preferred way since 69dee9413867 ("kobject: Add
support for default attribute groups to kobj_type") so that we can soon
get rid of the obsolete default_attrs field.
Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Cristobal Forno <cforno12@linux.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: linuxppc-dev@lists.ozlabs.org Cc: netdev@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jiapeng Chong [Wed, 5 Jan 2022 15:22:37 +0000 (23:22 +0800)]
sfc: Use swap() instead of open coding it
Clean the following coccicheck warning:
./drivers/net/ethernet/sfc/efx_channels.c:870:36-37: WARNING opportunity
for swap().
./drivers/net/ethernet/sfc/efx_channels.c:824:36-37: WARNING opportunity
for swap().
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Convert the PCS selection to use mac_select_pcs, which allows the PCS
to perform any validation it needs.
We must use separate phylink_pcs instances for the USX and SGMII PCS,
rather than just changing the "ops" pointer before re-setting it to
phylink as this interface queries the PCS, rather than requesting it
to be changed.
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Coco Li [Wed, 5 Jan 2022 10:48:38 +0000 (02:48 -0800)]
gro: add ability to control gro max packet size
Eric Dumazet suggested to allow users to modify max GRO packet size.
We have seen GRO being disabled by users of appliances (such as
wifi access points) because of claimed bufferbloat issues,
or some work arounds in sch_cake, to split GRO/GSO packets.
Instead of disabling GRO completely, one can chose to limit
the maximum packet size of GRO packets, depending on their
latency constraints.
This patch adds a per device gro_max_size attribute
that can be changed with ip link command.
ip link set dev eth0 gro_max_size 16000
Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Coco Li <lixiaoyan@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: fix SOF_TIMESTAMPING_BIND_PHC to work with multiple sockets
When multiple sockets using the SOF_TIMESTAMPING_BIND_PHC flag received
a packet with a hardware timestamp (e.g. multiple PTP instances in
different PTP domains using the UDPv4/v6 multicast or L2 transport),
the timestamps received on some sockets were corrupted due to repeated
conversion of the same timestamp (by the same or different vclocks).
Fix ptp_convert_timestamp() to not modify the shared skb timestamp
and return the converted timestamp as a ktime_t instead. If the
conversion fails, return 0 to not confuse the application with
timestamps corresponding to an unexpected PHC.
Fixes: fe495d8097cd ("net: socket: support hardware timestamp conversion to PHC bound") Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com> Cc: Yangbo Lu <yangbo.lu@nxp.com> Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 5 Jan 2022 22:11:50 +0000 (00:11 +0200)]
net: dsa: warn about dsa_port and dsa_switch bit fields being non atomic
As discussed during review here:
https://patchwork.kernel.org/project/netdevbpf/patch/20220105132141.2648876-3-vladimir.oltean@nxp.com/
we should inform developers about pitfalls of concurrent access to the
boolean properties of dsa_switch and dsa_port, now that they've been
converted to bit fields. No other measure than a comment needs to be
taken, since the code paths that update these bit fields are not
concurrent with each other.
Suggested-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 5 Jan 2022 22:11:49 +0000 (00:11 +0200)]
net: dsa: don't enumerate dsa_switch and dsa_port bit fields using commas
This is a cosmetic incremental fixup to commits 8c176ce90d1a ("net: dsa: merge all bools of struct dsa_switch into a single u32") c7bec1ef6a8e ("net: dsa: merge all bools of struct dsa_port into a single u8")
The desire to make this change was enunciated after posting these
patches here:
https://patchwork.kernel.org/project/netdevbpf/cover/20220105132141.2648876-1-vladimir.oltean@nxp.com/
but due to a slight timing overlap (message posted at 2:28 p.m. UTC,
merge commit is at 2:46 p.m. UTC), that comment was missed and the
changes were applied as-is.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 6 Jan 2022 11:59:10 +0000 (11:59 +0000)]
Merge branch 'dsa-init-cleanups'
Vladimir Oltean says:
====================
DSA initialization cleanups
These patches contain miscellaneous work that makes the DSA init code
path symmetric with the teardown path, and some additional patches
carried by Ansuel Smith for his register access over Ethernet work, but
those patches can be applied as-is too.
https://patchwork.kernel.org/project/netdevbpf/patch/20211214224409.5770-3-ansuelsmth@gmail.com/
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 5 Jan 2022 23:11:17 +0000 (01:11 +0200)]
net: dsa: setup master before ports
It is said that as soon as a network interface is registered, all its
resources should have already been prepared, so that it is available for
sending and receiving traffic. One of the resources needed by a DSA
slave interface is the master.
dsa_tree_setup
-> dsa_tree_setup_ports
-> dsa_port_setup
-> dsa_slave_create
-> register_netdevice
-> dsa_tree_setup_master
-> dsa_master_setup
-> sets up master->dsa_ptr, which enables reception
Therefore, there is a short period of time after register_netdevice()
during which the master isn't prepared to pass traffic to the DSA layer
(master->dsa_ptr is checked by eth_type_trans). Same thing during
unregistration, there is a time frame in which packets might be missed.
Note that this change opens us to another race: dsa_master_find_slave()
will get invoked potentially earlier than the slave creation, and later
than the slave deletion. Since dp->slave starts off as a NULL pointer,
the earlier calls aren't a problem, but the later calls are. To avoid
use-after-free, we should zeroize dp->slave before calling
dsa_slave_destroy().
In practice I cannot really test real life improvements brought by this
change, since in my systems, netdevice creation races with PHY autoneg
which takes a few seconds to complete, and that masks quite a few races.
Effects might be noticeable in a setup with fixed links all the way to
an external system.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 5 Jan 2022 23:11:16 +0000 (01:11 +0200)]
net: dsa: first set up shared ports, then non-shared ports
After commit 654b5502be55 ("net: dsa: flush switchdev workqueue before
tearing down CPU/DSA ports"), the port setup and teardown procedure
became asymmetric.
The fact of the matter is that user ports need the shared ports to be up
before they can be used for CPU-initiated termination. And since we
register net devices for the user ports, those won't be functional until
we also call the setup for the shared (CPU, DSA) ports. But we may do
that later, depending on the port numbering scheme of the hardware we
are dealing with.
It just makes sense that all shared ports are brought up before any user
port is. I can't pinpoint any issue due to the current behavior, but
let's change it nonetheless, for consistency's sake.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 5 Jan 2022 23:11:15 +0000 (01:11 +0200)]
net: dsa: hold rtnl_mutex when calling dsa_master_{setup,teardown}
DSA needs to simulate master tracking events when a binding is first
with a DSA master established and torn down, in order to give drivers
the simplifying guarantee that ->master_state_change calls are made
only when the master's readiness state to pass traffic changes.
master_state_change() provide a operational bool that DSA driver can use
to understand if DSA master is operational or not.
To avoid races, we need to block the reception of
NETDEV_UP/NETDEV_CHANGE/NETDEV_GOING_DOWN events in the netdev notifier
chain while we are changing the master's dev->dsa_ptr (this changes what
netdev_uses_dsa(dev) reports).
The dsa_master_setup() and dsa_master_teardown() functions optionally
require the rtnl_mutex to be held, if the tagger needs the master to be
promiscuous, these functions call dev_set_promiscuity(). Move the
rtnl_lock() from that function and make it top-level.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
So the dev_set_mtu() call from dsa_master_setup() has been effectively
superseded by the dsa_slave_change_mtu(slave_dev, ETH_DATA_LEN) that is
done from dsa_slave_create() for each user port. The later function also
updates the master MTU according to the largest user port MTU from the
tree. Therefore, updating the master MTU through a separate code path
isn't needed.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 5 Jan 2022 23:11:13 +0000 (01:11 +0200)]
net: dsa: merge rtnl_lock sections in dsa_slave_create
Currently dsa_slave_create() has two sequences of rtnl_lock/rtnl_unlock
in a row. Remove the rtnl_unlock() and rtnl_lock() in between, such that
the operation can execute slighly faster.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 5 Jan 2022 23:11:12 +0000 (01:11 +0200)]
net: dsa: reorder PHY initialization with MTU setup in slave.c
In dsa_slave_create() there are 2 sections that take rtnl_lock():
MTU change and netdev registration. They are separated by PHY
initialization.
There isn't any strict ordering requirement except for the fact that
netdev registration should be last. Therefore, we can perform the MTU
change a bit later, after the PHY setup. A future change will then be
able to merge the two rtnl_lock sections into one.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
xdp: Add xdp_do_redirect_frame() for pre-computed xdp_frames
Add an xdp_do_redirect_frame() variant which supports pre-computed
xdp_frame structures. This will be used in bpf_prog_run() to avoid having
to write to the xdp_frame structure when the XDP program doesn't modify the
frame boundaries.
xdp: Move conversion to xdp_frame out of map functions
All map redirect functions except XSK maps convert xdp_buff to xdp_frame
before enqueueing it. So move this conversion of out the map functions
and into xdp_do_redirect(). This removes a bit of duplicated code, but more
importantly it makes it possible to support caller-allocated xdp_frame
structures, which will be added in a subsequent commit.
page_pool: Add callback to init pages when they are allocated
Add a new callback function to page_pool that, if set, will be called every
time a new page is allocated. This will be used from bpf_test_run() to
initialise the page data with the data provided by userspace when running
XDP programs with redirect turned on.
xdp: Allow registering memory model without rxq reference
The functions that register an XDP memory model take a struct xdp_rxq as
parameter, but the RXQ is not actually used for anything other than pulling
out the struct xdp_mem_info that it embeds. So refactor the register
functions and export variants that just take a pointer to the xdp_mem_info.
This is in preparation for enabling XDP_REDIRECT in bpf_prog_run(), using a
page_pool instance that is not connected to any network device.
First of all, sorry for taking more time to get back to this series and
thanks to all valuble feedback in series-1 at [1] from Jesper and Song
Liu.
Since then I have looked into what Jesper suggested in [2] and worked on
revising the patch series into several patches for ease of review:
v1->v2:
1/7: [No change]. Add VLAN tag (ID & Priority) to the generated Tx-Only
frames.
2/7: [No change]. Add DMAC and SMAC setting to the generated Tx-Only
frames. If parameters are not set, previous DMAC and SMAC are used.
3/7: [New]. Add support for selecting different CLOCK for clock_gettime()
used in get_nsecs.
4/7: [New]. This is a total rework from series-1 3/4-patch [3]. It uses
clock_nanosleep() suggested by Jesper. In addition, added statistic
for Tx schedule variance under application stat (-a|--app-stats).
Make the cyclic Tx operation and --poll mode to be mutually-
exclusive. Still, the ability to specify TX cycle time and used
together with batch size and packet count remain the same.
5/7: [New]. Add the support for TX process schedule policy and priority
setting. By default, SCHED_OTHER policy is used. This too is matching
the schedule policy setting in [2].
6/7: [Change]. This is update from series-1 4/4-patch [4]. Added TX clean
process time-out in 1s granularity with configurable retries count
(-O|--retries).
7/7: [New]. Added timestamp for TX packet following pktgen_hdr format
matching the implementation in [2]. However, the sequence ID remains
the same as it is instead of process schedule diff in [2].
To summarize on what program options have been added with v2 series
using an example below:-
I have tested the above with both tagged and untagged packet format and
based on the timestamp in tcpdump found that the timing of the batch
cyclic transmission is correct.
Appreciate if community can give the patch series v2 a try and point out
any gap.
Ong Boon Leong [Thu, 30 Dec 2021 03:54:47 +0000 (11:54 +0800)]
samples/bpf: xdpsock: Add timestamp for Tx-only operation
It may be useful to add timestamp for Tx packets for continuous or cyclic
transmit operation. The timestamp and sequence ID of a Tx packet are
stored according to pktgen header format. To enable per-packet timestamp,
use -y|--tstamp option. If timestamp is off, pktgen header is not
included in the UDP payload. This means receiving side can use the magic
number for pktgen for differentiation.
The implementation supports both VLAN tagged and untagged option. By
default, the minimum packet size is set at 64B. However, if VLAN tagged
is on (-V), the minimum packet size is increased to 66B just so to fit
the pktgen_hdr size.
Added hex_dump() into the code path just for future cross-checking.
As before, simply change to "#define DEBUG_HEXDUMP 1" to inspect the
accuracy of TX packet.
Ong Boon Leong [Thu, 30 Dec 2021 03:54:46 +0000 (11:54 +0800)]
samples/bpf: xdpsock: Add time-out for cleaning Tx
When user sets tx-pkt-count and in case where there are invalid Tx frame,
the complete_tx_only_all() process polls indefinitely. So, this patch
adds a time-out mechanism into the process so that the application
can terminate automatically after it retries 3*polling interval duration.
v1->v2:
Thanks to Jesper's and Song Liu's suggestion.
- clean-up git message to remove polling log
- make the Tx time-out retries configurable with 1s granularity
Ong Boon Leong [Thu, 30 Dec 2021 03:54:45 +0000 (11:54 +0800)]
samples/bpf: xdpsock: Add sched policy and priority support
By default, TX schedule policy is SCHED_OTHER (round-robin time-sharing).
To improve TX cyclic scheduling, we add SCHED_FIFO policy and its priority
by using -W FIFO or --policy=FIFO and -U <PRIO> or --schpri=<PRIO>.
A) From xdpsock --app-stats, for SCHED_OTHER policy:
$ xdpsock -i eth0 -t -N -z -T 1000 -b 16 -C 100000 -a
period min ave max cycle
Cyclic TX 1000000 53507 75334 712642 6250
B) For SCHED_FIFO policy and schpri=50:
$ xdpsock -i eth0 -t -N -z -T 1000 -b 16 -C 100000 -a -W FIFO -U 50
period min ave max cycle
Cyclic TX 1000000 3699 24859 54397 6250