This random series addresses some of suboptimal constructions used
in the main GRO entry point.
The main body is gro_list_prepare() simplification and pointer usage
optimization in dev_gro_receive() itself. Being mostly cosmetic, it
gives like +10 Mbps on my setup to both TCP and UDP (both single- and
multi-flow).
Since v1 [0]:
- drop the replacement of bucket index calculation with
reciprocal_scale() since it makes absolutely no sense (Eric);
- improve stack usage in dev_gro_receive() (Eric);
- reverse the order of patches to avoid changes superseding.
gro: give 'hash' variable in dev_gro_receive() a less confusing name
'hash' stores not the flow hash, but the index of the GRO bucket
corresponding to it.
Change its name to 'bucket' to avoid confusion while reading lines
like '__set_bit(hash, &napi->gro_bitmask)'.
Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
gro: consistentify napi->gro_hash[x] access in dev_gro_receive()
GRO bucket index doesn't change through the entire function.
Store a pointer to the corresponding bucket instead of its member
and use it consistently through the function.
It is performance-safe since &gro_list->list == gro_list.
Misc: remove superfluous braces around single-line branches.
Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
gro_list_prepare() always returns &napi->gro_hash[bucket].list,
without any variations. Moreover, it uses 'napi' argument only to
have access to this list, and calculates the bucket index for the
second time (firstly it happens at the beginning of
dev_gro_receive()) to do that.
Given that dev_gro_receive() already has an index to the needed
list, just pass it as the first argument to eliminate redundant
calculations, and make gro_list_prepare() return void.
Also, both arguments of gro_list_prepare() can be constified since
this function can only modify the skbs from the bucket list.
Signed-off-by: Alexander Lobakin <alobakin@pm.me> Signed-off-by: David S. Miller <davem@davemloft.net>
Shachar Raindel [Fri, 12 Mar 2021 23:45:27 +0000 (15:45 -0800)]
hv_netvsc: Add a comment clarifying batching logic
The batching logic in netvsc_send is non-trivial, due to
a combination of the Linux API and the underlying hypervisor
interface. Add a comment explaining why the code is written this
way.
Signed-off-by: Shachar Raindel <shacharr@microsoft.com> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 11 Mar 2021 10:32:53 +0000 (11:32 +0100)]
samples: pktgen: new append mode
To configure various complex flows we for sure can create custom
pktgen init scripts, but sometimes thats not that easy.
New "-a" (append) option in all the existing sample scripts allows
to append more "devices" into pktgen threads.
The most straightforward usecases for that are:
- using multiple devices. We have to generate full linerate on
all physical functions (ports) of our multiport device.
- pushing multiple flows (with different packet options)
Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit cd0801e3d55f ("net: stmmac: Do not accept invalid MTU values")
started using the TX FIFO size to verify what counts as a valid MTU
request for the stmmac driver. This is unset for the ipq806x variant.
Looking at older patches for this it seems the RX + TXs buffers can be
up to 8k, so set appropriately.
(I sent this as an RFC patch in June last year, but received no replies.
I've been running with this on my hardware (a MikroTik RB3011) since
then with larger MTUs to support both the internal qca8k switch and
VLANs with no problems. Without the patch it's impossible to set the
larger MTU required to support this.)
Signed-off-by: Jonathan McDowell <noodles@earth.li> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ethernet: marvell: Fixed typo in the file sky2.c
s/calclation/calculation/
Signed-off-by: Bhaskar Chowdhury <unixbhaskar@gmail.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 13 Mar 2021 22:30:48 +0000 (14:30 -0800)]
Merge branch 'dsa-hewllcreek-dumps'
Kurt Kanzenbach says:
====================
net: dsa: hellcreek: Add support for dumping tables
add support for dumping the VLAN and FDB table via devlink. As the driver uses
internal VLANs and static FDB entries, this is a useful debugging feature.
Changes since v1:
* Drop memory reporting as there are better APIs to expose this
* Move comment to VLAN patch
Kurt Kanzenbach [Sat, 13 Mar 2021 09:39:39 +0000 (10:39 +0100)]
net: dsa: hellcreek: Add devlink FDB region
Allow to dump the FDB table via devlink. This is a useful debugging feature.
Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kurt Kanzenbach [Sat, 13 Mar 2021 09:39:38 +0000 (10:39 +0100)]
net: dsa: hellcreek: Move common code to helper
There are two functions which need to populate fdb entries. Move that to a
helper function.
Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kurt Kanzenbach [Sat, 13 Mar 2021 09:39:37 +0000 (10:39 +0100)]
net: dsa: hellcreek: Use boolean value
hellcreek_select_vlan() takes a boolean instead of an integer.
So, use false accordingly.
Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Kurt Kanzenbach [Sat, 13 Mar 2021 09:39:36 +0000 (10:39 +0100)]
net: dsa: hellcreek: Add devlink VLAN region
Allow to dump the VLAN table via devlink. This especially useful, because the
driver internally leverages VLANs for the port separation. These are not visible
via the bridge utility.
Signed-off-by: Kurt Kanzenbach <kurt@kmk-computers.de> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 13 Mar 2021 22:18:10 +0000 (14:18 -0800)]
Merge branch 'pps-policing'
Simon Horman says:
====================
net/sched: act_police: add support for packet-per-second policing
This series enhances the TC policer action implementation to allow a
policer action instance to enforce a rate-limit based on
packets-per-second, configurable using a packet-per-second rate and burst
parameters.
In the hope of aiding review this is broken up into three patches.
* [PATCH 1/3] flow_offload: add support for packet-per-second policing
Add support for this feature to the flow_offload API that is used to allow
programming flows, including TC rules and their actions, into hardware.
* [PATCH 2/3] flow_offload: reject configuration of packet-per-second policing in offload drivers
Teach all exiting users of the flow_offload API that allow offload of
policer action instances to reject offload if packet-per-second rate
limiting is configured: none support it at this time
* [PATCH 3/3] net/sched: act_police: add support for packet-per-second policing
With the above ground-work in place add the new feature to the TC policer
action itself
With the above in place the feature may be used.
As follow-ups we plan to provide:
* Corresponding updates to iproute2
* Corresponding self tests (which depend on the iproute2 changes)
* Hardware offload support for the NFP driver
Key changes since v2:
* Added patches 1 and 2, which makes adding patch 3 safe for existing
hardware offload of the policer action
* Re-worked patch 3 so that a TC policer action instance may be configured
for packet-per-second or byte-per-second rate limiting, but not both.
* Corrected kdoc usage
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Baowen Zheng [Fri, 12 Mar 2021 14:08:31 +0000 (15:08 +0100)]
net/sched: act_police: add support for packet-per-second policing
Allow a policer action to enforce a rate-limit based on packets-per-second,
configurable using a packet-per-second rate and burst parameters.
e.g.
tc filter add dev tap1 parent ffff: u32 match \
u32 0 0 police pkts_rate 3000 pkts_burst 1000
Testing was unable to uncover a performance impact of this change on
existing features.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Baowen Zheng [Fri, 12 Mar 2021 14:08:30 +0000 (15:08 +0100)]
flow_offload: reject configuration of packet-per-second policing in offload drivers
A follow-up patch will allow users to configures packet-per-second policing
in the software datapath. In preparation for this, teach all drivers that
support offload of the policer action to reject such configuration as
currently none of them support it.
Signed-off-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xingfeng Hu [Fri, 12 Mar 2021 14:08:29 +0000 (15:08 +0100)]
flow_offload: add support for packet-per-second policing
Allow flow_offload API to configure packet-per-second policing using rate
and burst parameters.
Dummy implementations of tcf_police_rate_pkt_ps() and
tcf_police_burst_pkt() are supplied which return 0, the unconfigured state.
This is to facilitate splitting the offload, driver, and TC code portion of
this feature into separate patches with the aim of providing a logical flow
for review. And the implementation of these helpers will be filled out by a
follow-up patch.
Signed-off-by: Xingfeng Hu <xingfeng.hu@corigine.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: Louis Peens <louis.peens@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Guangbin Huang [Fri, 12 Mar 2021 08:50:16 +0000 (16:50 +0800)]
net: hns3: add phy loopback support for imp-controlled PHYs
If the imp-controlled PHYs feature is enabled, driver can not
call phy driver interface to set loopback anymore and needs
to send command to firmware to start phy loopback.
Driver reuses the existing firmware command 0x0315 to start
phy loopback, just add a setting bit in this command. As this
command is not only for serdes loopback anymore, rename this
command to "xxx_COMMON_LOOPBACK", and modify function name,
macro name and logs related to it.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Guangbin Huang [Fri, 12 Mar 2021 08:50:15 +0000 (16:50 +0800)]
net: hns3: add ioctl support for imp-controlled PHYs
When the imp-controlled PHYs feature is enabled, driver will not
register mdio bus. In order to support ioctl ops for phy tool to
read or write phy register in this case, the firmware implement
a new command for driver and driver implement ioctl by using this
new command.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Guangbin Huang [Fri, 12 Mar 2021 08:50:14 +0000 (16:50 +0800)]
net: hns3: add get/set pause parameters support for imp-controlled PHYs
When the imp-controlled PHYs feature is enabled, phydev is NULL.
In this case, the autoneg is always off when user uses ethtool -a
command to get pause parameters because hclge_get_pauseparam()
uses phydev to check whether device is TP port. To fit this new
feature, use media type to check whether device is TP port.
And when user set pause parameters, these parameters need to
always set to mac, no matter whether autoneg is off.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Guangbin Huang [Fri, 12 Mar 2021 08:50:13 +0000 (16:50 +0800)]
net: hns3: add support for imp-controlled PHYs
IMP(Intelligent Management Processor) firmware add a new feature
to take control of PHYs for some new devices, PF driver adds
support for this feature.
Driver queries device's capability to check whether IMP supports
this feature, it will tell IMP to enable this feature by firmware
compatible command if it is supported.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 13 Mar 2021 01:50:42 +0000 (17:50 -0800)]
Merge branch 'sh_eth-reg-defs'
Sergey Shtylyov says:
====================
sh_eth: Improve the register/bit definitions in the Ether driver
Here are 4 patches against DaveM's 'net-next' repo. Mainly I'm renaming the register *enum*
tags/entries to match the SoC manuals,and also moving the RX-TX descriptor *enum*s closer to
the corresponding *struct*s...
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Shtylyov [Fri, 12 Mar 2021 20:47:02 +0000 (23:47 +0300)]
sh_eth: place RX/TX descriptor *enum*s after their *struct*s
Place the RX/TX descriptor bit *enum*s where they belong -- after the
corresponding RX/TX descriptor *struct*s and, while at it, switch to
declaring one *enum* entry per line...
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Shtylyov [Fri, 12 Mar 2021 20:44:53 +0000 (23:44 +0300)]
sh_eth: rename PSR bits
In all the SoC manuals (except R-Car gen2) the PHY status register's name
is abbreviated to PSR with the only valid bit 0 named LMON. Follow the
suit and rename the corresponding *enum* tag/entry.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Sergey Shtylyov [Fri, 12 Mar 2021 20:43:46 +0000 (23:43 +0300)]
sh_eth: rename TRSCER bits
In all the SoC manuals the TRSCER register bits match the corresponding
EESR registers's bits, but only on the R-Car gen2 SoC those are named
RINT<n> and TINT<n>. Follow the suit and rename the *enum* tag/entries
from DESC_I_* to TRSCER_*.
Signed-off-by: Sergey Shtylyov <s.shtylyov@omprussia.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
mptcp: Include multiple address ids in RM_ADDR
Here's a patch series from the MPTCP tree that extends the capabilities
of the MPTCP RM_ADDR header.
MPTCP peers can exchange information about their IP addresses that are
available for additional MPTCP subflows. IP addresses are advertised
with an ADD_ADDR header type, and those advertisements are revoked with
the RM_ADDR header type. RFC 8684 allows the RM_ADDR header to include
more than one address ID, so multiple advertisements can be revoked in a
single header. Previous kernel versions have only used RM_ADDR with a
single address ID, so multiple removals required multiple packets.
Patches 1-4 plumb address id list structures around the MPTCP code,
where before only a single address ID was passed.
Patches 5-8 make use of the address lists at the path manager layer that
tracks available addresses for both peers.
Patches 9-11 update the selftests to cover the new use of RM_ADDR with
multiple address IDs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:21 +0000 (17:16 -0800)]
selftests: mptcp: add testcases for removing addrs
This patch added the testcases for removing a list of addresses. Used
the netlink to flush the addresses in the testcases.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:20 +0000 (17:16 -0800)]
selftests: mptcp: set addr id for removing testcases
The removing testcases can only delete the addresses from id 1, this
patch added the support for deleting the addresses from any id that user
set.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:19 +0000 (17:16 -0800)]
selftests: mptcp: add invert argument for chk_rm_nr
Some of the removing testcases used two zeros as arguments for chk_rm_nr
like this: chk_rm_nr 0 0. This doesn't mean that no RM_ADDR has been sent.
It only means that RM_ADDR had been sent in the opposite direction that
chk_rm_nr is checking.
This patch added a new argument invert for chk_rm_nr to allow it can
check the RM_ADDR from the opposite direction.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:18 +0000 (17:16 -0800)]
mptcp: remove a list of addrs when flushing
This patch invoked mptcp_nl_remove_addrs_list to remove a list of addresses
when the netlink flushes addresses, instead of using
mptcp_nl_remove_subflow_and_signal_addr to remove them one by one.
And dropped the unused parameter net in __flush_addrs too.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:17 +0000 (17:16 -0800)]
mptcp: remove multi addresses and subflows in PM
This patch implemented the function to remove a list of addresses and
subflows, named mptcp_nl_remove_addrs_list, which had a input parameter
rm_list as the removing addresses list.
In mptcp_nl_remove_addrs_list, traverse all the existing msk sockets to
invoke mptcp_pm_remove_addrs_and_subflows to remove a list of addresses
for each msk socket.
In mptcp_pm_remove_addrs_and_subflows, traverse all the addresses in the
removing addresses list, to find whether this address is in the conn_list
or anno_list. If it is, put the address ID into the removing address list
or the removing subflow list, and pass the two lists to
mptcp_pm_remove_addr and mptcp_pm_remove_subflow.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:16 +0000 (17:16 -0800)]
mptcp: remove multi subflows in PM
This patch dealt with removing multi subflows in PM:
In mptcp_pm_remove_subflow, changed the input parameter local_id as an
list of removing address ids, and passed the list to
mptcp_pm_nl_rm_subflow_received.
In mptcp_pm_nl_rm_subflow_received, iterated each address id from the
received ids list. Then shut down and closed each address id's subsocket.
In mptcp_nl_remove_subflow_and_signal_addr, put the single address id into
an ids list, and passed it to mptcp_pm_remove_subflow.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:15 +0000 (17:16 -0800)]
mptcp: remove multi addresses in PM
This patch dropped the member rm_id of struct mptcp_pm_data. Use
rm_list_rx in mptcp_pm_nl_rm_addr_received instead of using rm_id.
In mptcp_pm_nl_rm_addr_received, iterated each address id from
pm.rm_list_rx, then shut down and closed each address id's subsocket.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:14 +0000 (17:16 -0800)]
mptcp: add rm_list_rx in mptcp_pm_data
This patch added a new member rm_list_rx for struct mptcp_pm_data as an
list of the removing address ids on the incoming direction. Initialized
its nr field to zero in mptcp_pm_data_init.
In mptcp_pm_rm_addr_received, set it as the input rm_list.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:13 +0000 (17:16 -0800)]
mptcp: add rm_list in mptcp_options_received
This patch changed the member rm_id in struct mptcp_options_received as a
list of the removing address ids, and renamed it to rm_list.
In mptcp_parse_option, parsed the RM_ADDR suboption and filled them into
the rm_list in struct mptcp_options_received.
In mptcp_incoming_options, passed this rm_list to the function
mptcp_pm_rm_addr_received.
It also changed the parameter type of mptcp_pm_rm_addr_received.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:12 +0000 (17:16 -0800)]
mptcp: add rm_list_tx in mptcp_pm_data
This patch added a new member rm_list_tx for struct mptcp_pm_data as the
removing address list on the outgoing direction. Initialize its nr field
to zero in mptcp_pm_data_init.
In mptcp_pm_remove_anno_addr, put the single address id into an removing
list, and passed it to mptcp_pm_remove_addr.
In mptcp_pm_remove_addr, save the input rm_list to rm_list_tx in struct
mptcp_pm_data.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Geliang Tang [Sat, 13 Mar 2021 01:16:11 +0000 (17:16 -0800)]
mptcp: add rm_list in mptcp_out_options
This patch defined a new struct mptcp_rm_list, the ids field was an
array of the removing address ids, the nr field was the valid number of
removing address ids in the array. The array size was definced as a new
macro MPTCP_RM_IDS_MAX. Changed the member rm_id of struct
mptcp_out_options to rm_list.
In mptcp_established_options_rm_addr, invoked mptcp_pm_rm_addr_signal to
get the rm_list. According the number of addresses in it, calculated
the padded RM_ADDR suboption length. And saved the ids array in struct
mptcp_out_options's rm_list member.
In mptcp_write_options, iterated each address id from struct
mptcp_out_options's rm_list member, set the invalid ones as TCPOPT_NOP,
then filled them into the RM_ADDR suboption.
Changed TCPOLEN_MPTCP_RM_ADDR_BASE from 4 to 3.
Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Support for resilient next-hop groups was added in a previous patch set.
Resilient next hop groups add a layer of indirection between the SKB hash
and the next hop. Thus the hash is used to reference a hash table bucket,
which is then used to reference a particular next hop. This allows the
system more flexibility when assigning SKB hash space to next hops.
Previously, each next hop had to be assigned a continuous range of SKB hash
space. With a hash table as an intermediate layer, it is possible to
reassign next hops with a hash table bucket granularity. In turn, this
mends issues with traffic flow redirection resulting from next hop removal
or adjustments in next-hop weights.
This patch set introduces mock offloading of resilient next hop groups by
the netdevsim driver, and a suite of selftests.
- Patch #1 adds a netdevsim-specific lock to protect next-hop hashtable.
Previously, netdevsim relied on RTNL to maintain mutual exclusion.
Patch #2 extracts a helper to make the following patches clearer.
- Patch #3 implements the support for offloading of resilient next-hop
groups.
- Patch #4 introduces a new debugfs interface to set activity on a selected
next-hop bucket. This simulates how HW can periodically report bucket
activity, and buckets thus marked are expected to be exempt from
migration to new next hops when the group changes.
- Patches #5 and #6 clean up the fib_nexthop selftests.
- Patches #7, #8 and #9 add tests for resilient next hop groups. Patch #7
adds resilient-hashing counterparts to fib_nexthops.sh. Patch #8 adds a
new traffic test for resilient next-hop groups. Patch #9 adds a new
traffic test for tunneling.
- Patch #10 actually leverages the netdevsim offload to implement a suite
of algorithmic tests that verify how and when buckets are migrated under
various simulated workload scenarios.
The overall plan is to contribute approximately the following patchsets:
1) Nexthop policy refactoring (already pushed)
2) Preparations for resilient next hop groups (already pushed)
3) Implementation of resilient next hop group (already pushed)
4) Netdevsim offload plus a suite of selftests (this patchset)
5) Preparations for mlxsw offload of resilient next-hop groups
6) mlxsw offload including selftests
Interested parties can look at the complete code at [2].
Ido Schimmel [Fri, 12 Mar 2021 16:50:26 +0000 (17:50 +0100)]
selftests: netdevsim: Add test for resilient nexthop groups offload API
Test various aspects of the resilient nexthop group offload API on top
of the netdevsim implementation. Both good and bad flows are tested.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Co-developed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 12 Mar 2021 16:50:25 +0000 (17:50 +0100)]
selftests: forwarding: Add resilient multipath tunneling nexthop test
Add a resilient nexthop objects version of gre_multipath_nh.sh. Test
that both IPv4 and IPv6 overlays work with resilient nexthop groups
where the nexthops are two GRE tunnels.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 12 Mar 2021 16:50:24 +0000 (17:50 +0100)]
selftests: forwarding: Add resilient hashing test
Verify that IPv4 and IPv6 multipath forwarding works correctly with
resilient nexthop groups and with different weights.
Test that when the idle timer is not zero, the resilient groups are not
rebalanced - because the nexthop buckets are considered active - and the
initial weights (1:1) are used.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 12 Mar 2021 16:50:23 +0000 (17:50 +0100)]
selftests: fib_nexthops: Test resilient nexthop groups
Add test cases for resilient nexthop groups. Exhaustive forwarding tests
are added separately under net/forwarding/.
Examples:
# ./fib_nexthops.sh -t basic_res
Basic resilient nexthop group functional tests
----------------------------------------------
TEST: Add a nexthop group with default parameters [ OK ]
TEST: Get a nexthop group with default parameters [ OK ]
TEST: Get a nexthop group with non-default parameters [ OK ]
TEST: Add a nexthop group with 0 buckets [ OK ]
TEST: Replace nexthop group parameters [ OK ]
TEST: Get a nexthop group after replacing parameters [ OK ]
TEST: Replace idle timer [ OK ]
TEST: Get a nexthop group after replacing idle timer [ OK ]
TEST: Replace unbalanced timer [ OK ]
TEST: Get a nexthop group after replacing unbalanced timer [ OK ]
TEST: Replace with no parameters [ OK ]
TEST: Get a nexthop group after replacing no parameters [ OK ]
TEST: Replace nexthop group type - implicit [ OK ]
TEST: Replace nexthop group type - explicit [ OK ]
TEST: Replace number of nexthop buckets [ OK ]
TEST: Get a nexthop group after replacing with invalid parameters [ OK ]
TEST: Dump all nexthop buckets [ OK ]
TEST: Dump all nexthop buckets in a group [ OK ]
TEST: Dump all nexthop buckets with a specific nexthop device [ OK ]
TEST: Dump all nexthop buckets with a specific nexthop identifier [ OK ]
TEST: Dump all nexthop buckets in a non-existent group [ OK ]
TEST: Dump all nexthop buckets in a non-resilient group [ OK ]
TEST: Dump all nexthop buckets using a non-existent device [ OK ]
TEST: Dump all nexthop buckets with invalid 'groups' keyword [ OK ]
TEST: Dump all nexthop buckets with invalid 'fdb' keyword [ OK ]
TEST: Get a valid nexthop bucket [ OK ]
TEST: Get a nexthop bucket with valid group, but invalid index [ OK ]
TEST: Get a nexthop bucket from a non-resilient group [ OK ]
TEST: Get a nexthop bucket from a non-existent group [ OK ]
Tests passed: 29
Tests failed: 0
# ./fib_nexthops.sh -t ipv4_large_res_grp
IPv4 large resilient group (128k buckets)
-----------------------------------------
TEST: Dump large (x131072) nexthop buckets [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv6_large_res_grp
IPv6 large resilient group (128k buckets)
-----------------------------------------
TEST: Dump large (x131072) nexthop buckets [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv4_res_torture
IPv4 runtime resilient nexthop group torture
--------------------------------------------
TEST: IPv4 resilient nexthop group torture test [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv6_res_torture
IPv6 runtime resilient nexthop group torture
--------------------------------------------
TEST: IPv6 resilient nexthop group torture test [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv4_res_grp_fcnal
IPv4 resilient groups functional
--------------------------------
TEST: Nexthop group updated when entry is deleted [ OK ]
TEST: Nexthop buckets updated when entry is deleted [ OK ]
TEST: Nexthop group updated after replace [ OK ]
TEST: Nexthop buckets updated after replace [ OK ]
TEST: Nexthop group updated when entry is deleted - nECMP [ OK ]
TEST: Nexthop buckets updated when entry is deleted - nECMP [ OK ]
TEST: Nexthop group updated after replace - nECMP [ OK ]
TEST: Nexthop buckets updated after replace - nECMP [ OK ]
Tests passed: 8
Tests failed: 0
# ./fib_nexthops.sh -t ipv6_res_grp_fcnal
IPv6 resilient groups functional
--------------------------------
TEST: Nexthop group updated when entry is deleted [ OK ]
TEST: Nexthop buckets updated when entry is deleted [ OK ]
TEST: Nexthop group updated after replace [ OK ]
TEST: Nexthop buckets updated after replace [ OK ]
TEST: Nexthop group updated when entry is deleted - nECMP [ OK ]
TEST: Nexthop buckets updated when entry is deleted - nECMP [ OK ]
TEST: Nexthop group updated after replace - nECMP [ OK ]
TEST: Nexthop buckets updated after replace - nECMP [ OK ]
Tests passed: 8
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Co-developed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 12 Mar 2021 16:50:22 +0000 (17:50 +0100)]
selftests: fib_nexthops: List each test case in a different line
The lines with the IPv4 and IPv6 test cases are already very long and
more test cases will be added in subsequent patches.
List each test case in a different line to make it easier to extend the
test with more test cases.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 12 Mar 2021 16:50:21 +0000 (17:50 +0100)]
selftests: fib_nexthops: Declutter test output
Before:
# ./fib_nexthops.sh -t ipv4_torture
IPv4 runtime torture
--------------------
TEST: IPv4 torture test [ OK ]
./fib_nexthops.sh: line 213: 19376 Killed ipv4_del_add_loop1
./fib_nexthops.sh: line 213: 19377 Killed ipv4_grp_replace_loop
./fib_nexthops.sh: line 213: 19378 Killed ip netns exec me ping -f 172.16.101.1 > /dev/null 2>&1
./fib_nexthops.sh: line 213: 19380 Killed ip netns exec me ping -f 172.16.101.2 > /dev/null 2>&1
./fib_nexthops.sh: line 213: 19381 Killed ip netns exec me mausezahn veth1 -B 172.16.101.2 -A 172.16.1.1 -c 0 -t tcp "dp=1-1023, flags=syn" > /dev/null 2>&1
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv6_torture
IPv6 runtime torture
--------------------
TEST: IPv6 torture test [ OK ]
./fib_nexthops.sh: line 213: 24453 Killed ipv6_del_add_loop1
./fib_nexthops.sh: line 213: 24454 Killed ipv6_grp_replace_loop
./fib_nexthops.sh: line 213: 24456 Killed ip netns exec me ping -f 2001:db8:101::1 > /dev/null 2>&1
./fib_nexthops.sh: line 213: 24457 Killed ip netns exec me ping -f 2001:db8:101::2 > /dev/null 2>&1
./fib_nexthops.sh: line 213: 24458 Killed ip netns exec me mausezahn -6 veth1 -B 2001:db8:101::2 -A 2001:db8:91::1 -c 0 -t tcp "dp=1-1023, flags=syn" > /dev/null 2>&1
Tests passed: 1
Tests failed: 0
After:
# ./fib_nexthops.sh -t ipv4_torture
IPv4 runtime torture
--------------------
TEST: IPv4 torture test [ OK ]
Tests passed: 1
Tests failed: 0
# ./fib_nexthops.sh -t ipv6_torture
IPv6 runtime torture
--------------------
TEST: IPv6 torture test [ OK ]
Tests passed: 1
Tests failed: 0
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 12 Mar 2021 16:50:20 +0000 (17:50 +0100)]
netdevsim: Allow reporting activity on nexthop buckets
A key component of the resilient hashing algorithm is the hash buckets'
activity. If a bucket is active, it will not be populated with a new
nexthop in order not to break existing flows. Therefore, in order to
easily and thoroughly test the algorithm, we need to be in full control
over the reported activity.
Add a debugfs interface that allows user space to have netdevsim report
a nexthop bucket within a resilient nexthop group as active. For
example:
Will mark bucket 23 in nexthop group 10 as active.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 12 Mar 2021 16:50:19 +0000 (17:50 +0100)]
netdevsim: Add support for resilient nexthop groups
Allow resilient nexthop groups to be programmed and account their
occupancy according to their number of buckets. The nexthop group itself
as well as its buckets are marked with hardware flags (i.e.,
'RTNH_F_TRAP').
Replacement of a single nexthop bucket can fail using the following
debugfs knob:
# cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
N
# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
# cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_nexthop_bucket_replace
Y
Replacement of a resilient nexthop group can fail using the following
debugfs knob:
# cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
N
# echo 1 > /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
# cat /sys/kernel/debug/netdevsim/netdevsim10/fib/fail_res_nexthop_group_replace
Y
This enables testing of various error paths.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Fri, 12 Mar 2021 16:50:18 +0000 (17:50 +0100)]
netdevsim: Create a helper for setting nexthop hardware flags
Instead of calling nexthop_set_hw_flags(), call a helper. It will be
used to also set nexthop bucket flags in a subsequent patch.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petr Machata [Fri, 12 Mar 2021 16:50:17 +0000 (17:50 +0100)]
netdevsim: fib: Introduce a lock to guard nexthop hashtable
Currently netdevsim relies on RTNL to maintain exclusivity in accessing the
nexthop hash table. However, bucket notification may be called without RTNL
having been held. Instead, introduce a custom lock to guard the table.
Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sat, 13 Mar 2021 01:09:34 +0000 (17:09 -0800)]
Merge branch 'ptp-warnings'
Lee Jones says:
====================
Rid W=1 warnings from PTP
This set is part of a larger effort attempting to clean-up W=1
kernel builds, which are currently overwhelmingly riddled with
niggly little warnings.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Lee Jones [Fri, 12 Mar 2021 11:09:10 +0000 (11:09 +0000)]
ptp: ptp_p: Demote non-conformant kernel-doc headers and supply a param description
Fixes the following W=1 kernel build warning(s):
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'control' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'event' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'addend' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'accum' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'test' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_compare' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rsystime_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rsystime_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'systime_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'systime_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'trgt_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'trgt_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'asms_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'asms_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'amms_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'amms_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ch_control' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ch_event' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'tx_snap_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'tx_snap_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rx_snap_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'rx_snap_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'src_uuid_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'src_uuid_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_status' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_snap_lo' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'can_snap_hi' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_sel' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'ts_st' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'reserve1' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'stl_max_set_en' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'stl_max_set' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'reserve2' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:78: warning: Function parameter or member 'srst' not described in 'pch_ts_regs'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'regs' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'ptp_clock' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'caps' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'exts0_enabled' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'exts1_enabled' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'mem_base' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'mem_size' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'irq' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'pdev' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:121: warning: Function parameter or member 'register_lock' not described in 'pch_dev'
drivers/ptp/ptp_pch.c:128: warning: Function parameter or member 'station' not described in 'pch_params'
drivers/ptp/ptp_pch.c:291: warning: Function parameter or member 'pdev' not described in 'pch_set_station_address'
Cc: Richard Cochran <richardcochran@gmail.com> Cc: LAPIS SEMICONDUCTOR <tshimizu818@gmail.com> Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Lee Jones [Fri, 12 Mar 2021 11:09:09 +0000 (11:09 +0000)]
ptp: ptp_clockmatrix: Demote non-kernel-doc header to standard comment
Fixes the following W=1 kernel build warning(s):
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
drivers/ptp/ptp_clockmatrix.c:1408: warning: Cannot understand * @brief Maximum absolute value for write phase offset in picoseconds
Cc: Richard Cochran <richardcochran@gmail.com> Cc: IDT-support-1588@lm.renesas.com Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Lee Jones [Fri, 12 Mar 2021 11:09:08 +0000 (11:09 +0000)]
ptp_pch: Move 'pch_*()' prototypes to shared header
Fixes the following W=1 kernel build warning(s):
drivers/ptp/ptp_pch.c:193:6: warning: no previous prototype for ‘pch_ch_control_write’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:201:5: warning: no previous prototype for ‘pch_ch_event_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:212:6: warning: no previous prototype for ‘pch_ch_event_write’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:220:5: warning: no previous prototype for ‘pch_src_uuid_lo_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:231:5: warning: no previous prototype for ‘pch_src_uuid_hi_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:242:5: warning: no previous prototype for ‘pch_rx_snap_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:259:5: warning: no previous prototype for ‘pch_tx_snap_read’ [-Wmissing-prototypes]
drivers/ptp/ptp_pch.c:300:5: warning: no previous prototype for ‘pch_set_station_address’ [-Wmissing-prototypes]
Cc: Richard Cochran <richardcochran@gmail.com> (maintainer:PTP HARDWARE CLOCK SUPPORT) Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Flavio Suligoi <f.suligoi@asem.it> Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Lee Jones [Fri, 12 Mar 2021 11:09:07 +0000 (11:09 +0000)]
ptp_pch: Remove unused function 'pch_ch_control_read()'
Fixes the following W=1 kernel build warning(s):
drivers/ptp/ptp_pch.c:182:5: warning: no previous prototype for ‘pch_ch_control_read’ [-Wmissing-prototypes]
Cc: Richard Cochran <richardcochran@gmail.com> (maintainer:PTP HARDWARE CLOCK SUPPORT) Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: Flavio Suligoi <f.suligoi@asem.it> Cc: netdev@vger.kernel.org Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
On some SoCs (e.g. BCM4908, BCM631[345]8) SF2 has an integrated
crossbar. It allows connecting its selected external ports to internal
ports. It's used by vendors to handle custom Ethernet setups.
BCM4908 has following 3x2 crossbar. On Asus GT-AC5300 rgmii is used for
connecting external BCM53134S switch. GPHY4 is usually used for WAN
port. More fancy devices use SerDes for 2.5 Gbps Ethernet.
Use setup data based on DT info to configure BCM4908's switch port 7.
Right now only GPHY and rgmii variants are supported. Handling SerDes
can be implemented later.
Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Rafał Miłecki [Thu, 11 Mar 2021 12:35:21 +0000 (13:35 +0100)]
net: broadcom: bcm4908_enet: support TX interrupt
It appears that each DMA channel has its own interrupt and both rings
can be configured (the same way) to handle interrupts.
1. Make ring interrupts code generic (make it operate on given ring)
2. Move napi to ring (so each has its own)
3. Make IRQ handler generic (match ring against received IRQ number)
4. Add (optional) support for TX interrupt
Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 11 Mar 2021 20:18:13 +0000 (14:18 -0600)]
net: macb: Disable PCS auto-negotiation for SGMII fixed-link mode
When using a fixed-link configuration in SGMII mode, it's not really
sensible to have auto-negotiation enabled since the link settings are
fixed by definition. In other configurations, such as an SGMII
connection to a PHY, it should generally be enabled.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Robert Hancock [Thu, 11 Mar 2021 20:18:12 +0000 (14:18 -0600)]
net: macb: poll for fixed link state in SGMII mode
When using a fixed-link configuration with GEM in SGMII mode, such as
for a chip-to-chip interconnect, the link state was always showing as
established regardless of the actual connectivity state. We can monitor
the pcs_link_state bit in the Network Status register to determine
whether the PCS link state is actually up.
Signed-off-by: Robert Hancock <robert.hancock@calian.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paul Blakey [Mon, 21 Sep 2020 08:49:26 +0000 (11:49 +0300)]
net/mlx5: CT: Add support for mirroring
Add support for mirroring before the CT action by spliting the pre ct rule.
Mirror outputs are done first on the tc chain,prio table rule (the fwd
rule), which will then forward to a per port fwd table.
On this fwd table, we insert the original pre ct rule that forwards to
ct/ct nat table.
Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Maor Dickman <maord@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Alaa Hleihel [Thu, 31 Dec 2020 14:24:51 +0000 (16:24 +0200)]
net/mlx5: Display the command index in command mailbox dump
Multiple commands can be printed at the same time which can
lead to wrong order of their lines in dmesg output.
As a result, it's hard to match data dumps to the correct command
or which command was fully dumped at some point.
Fix this by displaying the corresponding command index, and also
indicate when a command was fully dumped.
Increasing the size of the indirection_rqt array from 128 to 256 bytes
pushed the stack usage of the mlx5e_hairpin_fill_rqt_rqns() function
over the warning limit when building with clang and CONFIG_KASAN:
drivers/net/ethernet/mellanox/mlx5/core/en_tc.c:970:1: error: stack frame size of 1180 bytes in function 'mlx5e_tc_add_nic_flow' [-Werror,-Wframe-larger-than=]
Using dynamic allocation here is safe because the caller does the
same, and it reduces the stack usage of the function to just a few
bytes.
net/mlx5e: Use net_prefetchw instead of prefetchw in MPWQE TX datapath
Commit 96239817ce92 ("net/mlx5e: RX, Add a prefetch command for small
L1_CACHE_BYTES") switched to using net_prefetchw at all places in mlx5e.
In the same time frame, commit 8fd8b87182ff ("net/mlx5e: Enhanced TX
MPWQE for SKBs") added one more usage of prefetchw. When these two
changes were merged, this new occurrence of prefetchw wasn't replaced
with net_prefetchw.
This commit fixes this last occurrence of prefetchw in
mlx5e_tx_mpwqe_session_start, making the same change that was done in
mlx5e_xdp_mpwqe_session_start.
Roi Dayan [Tue, 9 Mar 2021 16:25:59 +0000 (18:25 +0200)]
net/mlx5e: Remove redundant newline in NL_SET_ERR_MSG_MOD
Fix the following coccicheck warnings:
drivers/net/ethernet/mellanox/mlx5/core/devlink.c:145:29-66: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
drivers/net/ethernet/mellanox/mlx5/core/devlink.c:140:29-77: WARNING
avoid newline at end of message in NL_SET_ERR_MSG_MOD
Signed-off-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
docs: networking: phy: Improve placement of parenthesis
"either" is outside the parentheses, so the matching "or" should be too.
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 12 Mar 2021 02:35:31 +0000 (18:35 -0800)]
Merge branch 'tcp-delayed-completions'
Eric Dumazet says:
====================
tcp: better deal with delayed TX completions
Jakub and Neil reported an increase of RTO timers whenever
TX completions are delayed a bit more (by increasing
NIC TX coalescing parameters)
While problems have been there forever, second patch might
introduce some regressions so I prefer not backport
them to stable releases before things settle.
Many thanks to FB team for their help and tests.
Few packetdrill tests need to be changed to reflect
the improvements brought by this series.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Mar 2021 20:35:06 +0000 (12:35 -0800)]
tcp: remove obsolete check in __tcp_retransmit_skb()
TSQ provides a nice way to avoid bufferbloat on individual socket,
including retransmit packets. We can get rid of the old
heuristic:
/* Do not sent more than we queued. 1/4 is reserved for possible
* copying overhead: fragmentation, tunneling, mangling etc.
*/
if (refcount_read(&sk->sk_wmem_alloc) >
min_t(u32, sk->sk_wmem_queued + (sk->sk_wmem_queued >> 2),
sk->sk_sndbuf))
return -EAGAIN;
This heuristic was giving false positives according to Jakub,
whenever TX completions are delayed above RTT. (Ack packets
are processed by TCP stack before clones are orphaned/freed)
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Mar 2021 20:35:05 +0000 (12:35 -0800)]
tcp: consider using standard rtx logic in tcp_rcv_fastopen_synack()
Jakub reported Data included in a Fastopen SYN that had to be
retransmit would have to wait for an RTO if TX completions are slow,
even with prior fix.
This is because tcp_rcv_fastopen_synack() does not use standard
rtx logic, meaning TSQ handler exits early in tcp_tsq_write()
because tp->lost_out == tp->retrans_out
Lets make tcp_rcv_fastopen_synack() use standard rtx logic,
by using tcp_mark_skb_lost() on the skb thats needs to be
sent again.
Not this raised a warning in tcp_fastretrans_alert() during my tests
since we consider the data not being aknowledged
by the receiver does not mean packet was lost on the network.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 11 Mar 2021 20:35:04 +0000 (12:35 -0800)]
tcp: plug skb_still_in_host_queue() to TSQ
Jakub and Neil reported an increase of RTO timers whenever
TX completions are delayed a bit more (by increasing
NIC TX coalescing parameters)
Main issue is that TCP stack has a logic preventing a packet
being retransmit if the prior clone has not yet been
orphaned or freed.
This logic came with commit c264c97e44f8 ("tcp: avoid
retransmits of TCP packets hanging in host queues")
Thankfully, in the case skb_still_in_host_queue() detects
the initial clone is still in flight, it can use TSQ logic
that will eventually retry later, at the moment the clone
is freed or orphaned.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Neil Spring <ntspring@fb.com> Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hoang Le [Thu, 11 Mar 2021 03:33:22 +0000 (10:33 +0700)]
tipc: convert dest node's address to network order
(struct tipc_link_info)->dest is in network order (__be32), so we must
convert the value to network order before assigning. The problem detected
by sparse:
net/tipc/netlink_compat.c:699:24: warning: incorrect type in assignment (different base types)
net/tipc/netlink_compat.c:699:24: expected restricted __be32 [usertype] dest
net/tipc/netlink_compat.c:699:24: got int
Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
mlxsw: Implement sampling using mirroring
So far, sampling was implemented using a dedicated sampling mechanism
that is available on all Spectrum ASICs. Spectrum-2 and later ASICs
support sampling by mirroring packets to the CPU port with probability.
This method has a couple of advantages compared to the legacy method:
* Extra metadata per-packet: Egress port, egress traffic class, traffic
class occupancy and end-to-end latency
* Ability to sample packets on egress / per-flow as opposed to only
ingress
This series should not result in any user-visible changes and its aim is
to convert Spectrum-2 and later ASICs to perform sampling by mirroring
to the CPU port with probability. Future submissions will expose the
additional metadata and enable sampling using more triggers (e.g.,
egress).
Series overview:
Patches #1-#3 extend the SPAN (mirroring) module to accept new
parameters required for sampling. See individual commit messages for
detailed explanation.
Patch #4-#5 split sampling support between Spectrum-1 and later ASIC while
still using the legacy method for all ASIC generations.
Patch #6 converts Spectrum-2 and later ASICs to perform sampling by
mirroring to the CPU port with probability.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 11 Mar 2021 12:24:16 +0000 (14:24 +0200)]
mlxsw: spectrum_matchall: Implement sampling using mirroring
Spectrum-2 and later ASICs support sampling of packets by mirroring to
the CPU with probability. There are several advantages compared to the
legacy dedicated sampling mechanism:
* Extra metadata per-packet: Egress port, egress traffic class, traffic
class occupancy and end-to-end latency
* Ability to sample packets on egress / per-flow
Convert Spectrum-2 and later ASICs to perform sampling by mirroring to
the CPU with probability.
Subsequent patches will add support for egress / per-flow sampling and
expose the extra metadata.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 11 Mar 2021 12:24:15 +0000 (14:24 +0200)]
mlxsw: spectrum_trap: Split sampling traps between ASICs
Sampling of ingress packets is supported using a dedicated sampling
mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs
support more sophisticated sampling by mirroring packets to the CPU.
As a preparation for more advanced sampling configurations, split the trap
configuration used for sampled packets between Spectrum-1 and later ASICs.
This is needed since packets that are mirrored to the CPU are trapped
via a different trap identifier compared to packets that are sampled
using the dedicated sampling mechanism.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 11 Mar 2021 12:24:14 +0000 (14:24 +0200)]
mlxsw: spectrum_matchall: Split sampling support between ASICs
Sampling of ingress packets is supported using a dedicated sampling
mechanism on all Spectrum ASICs. However, Spectrum-2 and later ASICs
support more sophisticated sampling by mirroring packets to the CPU.
As a preparation for more advanced sampling configurations, split the
sampling operations between Spectrum-1 and later ASICs.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 11 Mar 2021 12:24:13 +0000 (14:24 +0200)]
mlxsw: spectrum_span: Add SPAN probability rate support
Currently, every packet that matches a mirroring trigger (e.g., received
packets, buffer dropped packets) is mirrored. Spectrum-2 and later ASICs
support mirroring with probability, where every 1 in N matched packets
is mirrored.
Extend the API that creates the binding between the trigger and the SPAN
agent with a probability rate parameter, which is an attribute of the
trigger. Set it to '1' to maintain existing behavior.
Subsequent patches will use it to perform more sophisticated sampling,
by mirroring packets to the CPU with probability.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 11 Mar 2021 12:24:12 +0000 (14:24 +0200)]
mlxsw: reg: Extend mirroring registers with probability rate field
The MPAR and MPAGR registers are used to configure the binding between
the mirroring trigger (e.g., received packet) and the SPAN agent. Add
probability rate field, which will allow us to support sampling by
mirroring to the CPU.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ido Schimmel [Thu, 11 Mar 2021 12:24:11 +0000 (14:24 +0200)]
mlxsw: spectrum_span: Add SPAN session identifier support
When packets are mirrored to the CPU, the trap identifier with which the
packets are trapped is determined according to the session identifier of
the SPAN agent performing the mirroring. Packets that are trapped for
the same logical reason (e.g., buffer drops) should use the same session
identifier.
Currently, a single session is implicitly supported (identifier 0) and
is used for packets that are mirrored to the CPU due to buffer drops
(e.g., early drop).
Subsequent patches are going to mirror packets to the CPU due to
sampling, which will require a different session identifier.
Prepare for that by making the session identifier an attribute of the
SPAN agent.
Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 12 Mar 2021 00:17:43 +0000 (16:17 -0800)]
Merge tag 'mlx5-updates-2021-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
This series provides some cleanups to mlx5 driver
For more information please see tag log below.
Please pull and let me know if there is any problem.
mlx5-updates-2021-03-11
Cleanups for mlx5 driver
1) Fix build warnings form Arnd and Vlad
2) Leon improves locking for driver load/unload flows
3) From Roi, Lockdep false dependency warning
4) Other trivial cleanups
====================
Signed-off-by: David S. Miller <davem@davemloft.net>