git.baikalelectronics.ru Git

net: dsa: implement a central TX reallocation procedure

At the moment, taggers are left with the task of ensuring that the skb
headers are writable (which they aren't, if the frames were cloned for
TX timestamping, for flooding by the bridge, etc), and that there is
enough space in the skb data area for the DSA tag to be pushed.

Moreover, the life of tail taggers is even harder, because they need to
ensure that short frames have enough padding, a problem that normal
taggers don't have.

The principle of the DSA framework is that everything except for the
most intimate hardware specifics (like in this case, the actual packing
of the DSA tag bits) should be done inside the core, to avoid having
code paths that are very rarely tested.

So provide a TX reallocation procedure that should cover the known needs
of DSA today.

Note that this patch also gives the network stack a good hint about the
headroom/tailroom it's going to need. Up till now it wasn't doing that.
So the reallocation procedure should really be there only for the
exceptional cases, and for cloned packets which need to be unshared.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Tested-by: Christian Eggers <ceggers@arri.de> # For tail taggers only
Tested-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

openvswitch: Use IS_ERR instead of IS_ERR_OR_NULL

Fix smatch warning:

net/openvswitch/meter.c:427 ovs_meter_cmd_set() warn: passing zero to 'PTR_ERR'

dp_meter_create() never returns NULL, use IS_ERR
instead of IS_ERR_OR_NULL to fix this.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Link: https://lore.kernel.org/r/20201031060153.39912-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: Remove duplicated include

Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20201031024940.29716-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

liquidio: cn68xx: Remove duplicated include

Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20201031024744.39020-1-yuehaibing@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: avoid slow start during fast recovery on new losses

During TCP fast recovery, the congestion control in charge is by
default the Proportional Rate Reduction (PRR) unless the congestion
control module specified otherwise (e.g. BBR).

Previously when tcp_packets_in_flight() is below snd_ssthresh PRR
would slow start upon receiving an ACK that
   1) cumulatively acknowledges retransmitted data
   and
   2) does not detect further lost retransmission

Such conditions indicate the repair is in good steady progress
after the first round trip of recovery. Otherwise PRR adopts the
packet conservation principle to send only the amount that was
newly delivered (indicated by this ACK).

This patch generalizes the previous design principle to include
also the newly sent data beside retransmission: as long as
the delivery is making good progress, both retransmission and
new data should be accounted to make PRR more cautious in slow
starting.

Suggested-by: Matt Mathis <mattmathis@google.com>
Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20201031013412.1973112-1-ycheng@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'vlan-improvements-for-ocelot-switch'

Vladimir Oltean says:

====================
VLAN improvements for Ocelot switch

The main reason why I started this work is that deleting the bridge mdb
entries fails when the bridge is deleted, as described here:
https://lore.kernel.org/netdev/20201015173355.564934-1-vladimir.oltean@nxp.com/

In short, that happens because the bridge mdb entries are added with a
vid of 1, but deletion is attempted with a vid of 0. So the deletion
code fails to find the mdb entries.

The solution is to make ocelot use a pvid of 0 when it is under a bridge
with vlan_filtering 0. When vlan_filtering is 1, the pvid of the bridge
is what is programmed into the hardware.

The patch series also uncovers more bugs and does some more cleanup, but
the above is the main idea behind it.
====================

Link: https://lore.kernel.org/r/20201031102916.667619-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: improve the workaround for multiple native VLANs on NPI port

After the good discussion with Florian from here:
https://lore.kernel.org/netdev/20200911000337.htwr366ng3nc3a7d@skbuf/

I realized that the VLAN settings on the NPI port (the hardware "CPU port",
in DSA parlance) don't actually make any difference, because that port
is hardcoded in hardware to use what mv88e6xxx would call "unmodified"
egress policy for VLANs.

So earlier patch 183be6f967fe ("net: dsa: felix: send VLANs on CPU port
as egress-tagged") was incorrect in the sense that it didn't actually
make the VLANs be sent on the NPI port as egress-tagged. It only made
ocelot_port_set_native_vlan shut up.

Now that we have moved the check from ocelot_port_set_native_vlan to
ocelot_vlan_prepare, we can simply shunt ocelot_vlan_prepare from DSA,
and avoid calling it. This is the correct way to deal with things,
because the NPI port configuration is DSA-specific, so the ocelot switch
library should not have the check for multiple native VLANs refined in
any way, it is correct the way it is.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: deny changing the native VLAN from the prepare phase

Put the preparation phase of switchdev VLAN objects to some good use,
and move the check we already had, for preventing the existence of more
than one egress-untagged VLAN per port, to the preparation phase of the
addition.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: move the logic to drop 802.1p traffic to the pvid deletion

Currently, the ocelot_port_set_native_vlan() function starts dropping
untagged and prio-tagged traffic when the native VLAN is removed?

What is the native VLAN? It is the only egress-untagged VLAN that ocelot
supports on a port. If the port is a trunk with 100 VLANs, one of those
VLANs can be transmitted as egress-untagged, and that's the native VLAN.

Is it wrong to drop untagged and prio-tagged traffic if there's no
native VLAN? Yes and no.

In this case, which is more typical, it's ok to apply that drop
configuration:
$ bridge vlan add dev swp0 vid 1 pvid untagged <- this is the native VLAN
$ bridge vlan add dev swp0 vid 100
$ bridge vlan add dev swp0 vid 101
$ bridge vlan del dev swp0 vid 1 <- delete the native VLAN
But only because the pvid and the native VLAN have the same ID.

In this case, it isn't:
$ bridge vlan add dev swp0 vid 1 pvid
$ bridge vlan add dev swp0 vid 100 untagged <- this is the native VLAN
$ bridge vlan del dev swp0 vid 101
$ bridge vlan del dev swp0 vid 100 <- delete the native VLAN

It's wrong, because the switch will drop untagged and prio-tagged
traffic now, despite having a valid pvid of 1.

The confusion seems to stem from the fact that the native VLAN is an
egress setting, while the PVID is an ingress setting. It would be
correct to drop untagged and prio-tagged traffic only if there was no
pvid on the port. So let's do just that.

Background:
https://lore.kernel.org/netdev/CA+h21hrRMrLH-RjBGhEJSTZd6_QPRSd3RkVRQF-wNKkrgKcRSA@mail.gmail.com/#t

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: add a "valid" boolean to struct ocelot_vlan

Currently we are checking in some places whether the port has a native
VLAN on egress or not, by comparing the ocelot_port->vid value with zero.

That works, because VID 0 can never be a native VLAN configured by the
bridge, but now we want to make similar checks for the pvid. That won't
work, because there are cases when we do have the pvid set to 0 (not by
the bridge, by ourselves, but still.. it's confusing). And we can't
encode a negative value into an u16, so add a bool to the structure.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: transform the pvid and native vlan values into a structure

This is a mechanical patch only.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: don't reset the pvid to 0 when deleting it

I have no idea why this code is here, but I have 2 hypotheses:

1.
A desperate attempt to keep untagged traffic working when the bridge
deletes the pvid on a port.

There was a fairly okay discussion here:
https://lore.kernel.org/netdev/CA+h21hrRMrLH-RjBGhEJSTZd6_QPRSd3RkVRQF-wNKkrgKcRSA@mail.gmail.com/#t
which established that in vlan_filtering=1 mode, the absence of a pvid
should denote that the ingress port should drop untagged and priority
tagged traffic. While in vlan_filtering=0 mode, nothing should change.

So in vlan_filtering=1 mode, we should simply let things happen, and not
attempt to save the day. And in vlan_filtering=0 mode, the pvid is 0
anyway, no need to do anything.

2.
The driver encodes the native VLAN (ocelot_port->vid) value of 0 as
special, meaning "not valid". There are checks based on that. But there
are no such checks for the ocelot_port->pvid value of 0. In fact, that's
a perfectly valid value, which is used in standalone mode. Maybe there
was some confusion and the author thought that 0 means "invalid" here as
well.

In conclusion, delete the code*.

*in fact we'll add it back later, in a slightly different form, but for
an entirely different reason than the one for which this exists now.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: use the pvid of zero when bridged with vlan_filtering=0

Currently, mscc_ocelot ports configure pvid=0 in standalone mode, and
inherit the pvid from the bridge when one is present.

When the bridge has vlan_filtering=0, the software semantics are that
packets should be received regardless of whether there's a pvid
configured on the ingress port or not. However, ocelot does not observe
those semantics today.

Moreover, changing the PVID is also a problem with vlan_filtering=0.
We are privately remapping the VID of FDB, MDB entries to the port's
PVID when those are VLAN-unaware (i.e. when the VID of these entries
comes to us as 0). But we have no logic of adjusting that remapping when
the user changes the pvid and vlan_filtering is 0. So stale entries
would be left behind, and untagged traffic will stop matching on them.

And even if we were to solve that, there's an even bigger problem. If
swp0 has pvid 1, and swp1 has pvid 2, and both are under a vlan_filtering=0
bridge, they should be able to forward traffic between one another.
However, with ocelot they wouldn't do that.

The simplest way of fixing this is to never configure the pvid based on
what the bridge is asking for, when vlan_filtering is 0. Only if there
was a VLAN that the bridge couldn't mangle, that we could use as pvid....
So, turns out, there's 0 just for that. And for a reason: IEEE
802.1Q-2018, page 247, Table 9-2-Reserved VID values says:

The null VID. Indicates that the tag header contains only
priority information; no VID is present in the frame.
This VID value shall not be configured as a PVID or a member
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
of a VID Set, or configured in any FDB entry, or used in any
Management operation.

So, aren't we doing exactly what 802.1Q says not to? Well, in a way, but
what we're doing here is just driver-level bookkeeping, all for the
better. The fact that we're using a pvid of 0 is not observable behavior
from the outside world: the network stack does not see the classified
VLAN that the switch uses, in vlan_filtering=0 mode. And we're also more
consistent with the standalone mode now.

And now that we use the pvid of 0 in this mode, there's another advantage:
we don't need to perform any VID remapping for FDB and MDB entries either,
we can just use the VID of 0 that the bridge is passing to us.

The only gotcha is that every time we change the vlan_filtering setting,
we need to reapply the pvid (either to 0, or to the value from the bridge).
A small side-effect visible in the patch is that ocelot_port_set_pvid
needs to be moved above ocelot_port_vlan_filtering, so that it can be
called from there without forward-declarations.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ethernet-ti-am65-cpsw-add-multi-port-support-in-mac-only-mode'

Grygorii Strashko says:

====================
net: ethernet: ti: am65-cpsw: add multi port support in mac-only mode

This series adds multi-port support in mac-only mode (multi MAC mode) to TI
AM65x CPSW driver in preparation for enabling support for multi-port devices,
like Main CPSW0 on K3 J721E SoC or future CPSW3g on K3 AM64x SoC.

The multi MAC mode is implemented by configuring every enabled port in "mac-only"
mode (all ingress packets are sent only to the Host port and egress packets
directed to target Ext. Port) and creating separate net_device for
every enabled Ext. port.

This series does not affect on existing CPSW2g one Ext. Port devices and xmit
path changes are done only for multi-port devices by splitting xmit path for
one-port and multi-port devices.

Patches 1-3: Preparation patches to improve K3 CPSW configuration depending on DT
Patches 4-5: Fix VLAN offload for multi MAC mode
Patch 6: Fixes CPTS context lose issue during PM runtime transition
Patch 7: Fixes TX csum offload for multi MAC mode
Patches 8-9: add multi-port support to TI AM65x CPSW
Patch 10: handle deferred probe with new dev_err_probe() API

changes in v3:
- rebased
- added Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
- added Patch 10 which is minor optimization

changes in v2:
- patch 8: xmit path split for one-port and multi-port devices to avoid
performance losses
- patch 9: fixed the case when Port 1 is disabled
- Patch 7: added fix for TX csum offload

v2: https://lore.kernel.org/patchwork/cover/1321608/
v1: https://lore.kernel.org/patchwork/cover/1315766/
====================

Link: https://lore.kernel.org/r/20201030200707.24294-1-grygorii.strashko@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: handle deferred probe with dev_err_probe()

Use new dev_err_probe() API to handle deferred probe properly and simplify
the code.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: add multi port support in mac-only mode

This patch adds final multi-port support to TI AM65x CPSW driver path in
preparation for adding support for multi-port devices, like Main CPSW0 on
K3 J721E SoC or future CPSW3g on K3 AM64x SoC.
- the separate netdev is created for every enabled external Port;
- DMA channels are common/shared for all external Ports and the RX/TX NAPI
and DMA processing assigned to first available netdev;
- external Ports are configured in mac-only mode, which is similar to TI
"dual-mac" mode for legacy TI CPSW - packets are sent to the Host port only
in ingress and directly to the Port on egress. No packet switching between
external ports happens.
- every port supports the same features as current AM65x CPSW on external
device.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: prepare xmit/rx path for multi-port devices in mac-only mode

This patch adds multi-port support to TI AM65x CPSW driver xmit/rx path in
preparation for adding support for multi-port devices, like Main CPSW0 on
K3 J721E SoC or future CPSW3g on K3 AM64x SoC.
Hence DMA channels are common/shared for all ext Ports and the RX/TX NAPI
and DMA processing going to be assigned to first available netdev this patch:
- ensures all RX descriptors fields are initialized;
- adds synchronization for TX DMA push/pop operation (locking) as
Networking core locks are not enough any more;
- updates TX bql processing for every packet in
am65_cpsw_nuss_tx_compl_packets() as every completed TX skb can have
different ndev assigned (come from different netdevs).

To avoid performance issues for existing one-port CPSW2g devices the above
changes are done only for multi-port devices by splitting xmit path for
one-port and multi-port devices.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: fix tx csum offload for multi mac mode

The current implementation uses .ndo_set_features() callback to track
NETIF_F_HW_CSUM feature changes and update generic
CPSW_P0_CONTROL_REG.RX_CHECKSUM_EN option accordingly. It's not going to
work in case of multi-port devices as TX csum offload can be changed per
netdev.

On K3 CPSWxG devices TX csum offload enabled in the following way:

- the CPSW_P0_CONTROL_REG.RX_CHECKSUM_EN option enables TX csum offload in
generic and affects all TX DMA channels and packets;

- corresponding fields in TX DMA descriptor have to be filed properly when
upper layer wants to offload TX csum (skb->ip_summed == CHECKSUM_PARTIAL)
and it's per-packet option.

The Linux Network core is expected to never request TX csum offload if
netdev NETIF_F_HW_CSUM feature is disabled, and, as result, TX DMA
descriptors should not be modified, and per-packet TX csum offload will be
disabled (or enabled) on per-netdev basis. Which, in turn, makes it safe to
enable the CPSW_P0_CONTROL_REG.RX_CHECKSUM_EN option unconditionally.

Hence, fix TX csum offload for multi-port devices by:
- enabling the CPSW_P0_CONTROL_REG.RX_CHECKSUM_EN option in
am65_cpsw_nuss_common_open() unconditionally
- and removing .ndo_set_features() callback implementation, which was used
only NETIF_F_HW_CSUM feature update purposes

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: keep active if cpts enabled

Some K3 CPSW NUSS instances can lose context after PM runtime ON->OFF->ON
transition depending on integration (including all submodules: CPTS, MDIO,
etc), like J721E Main CPSW (CPSW9G).

In case CPTS is enabled it's initialized during probe and does not expect
to be reset. Hence, keep K3 CPSW active by forbidding PM runtime if CPTS is
enabled.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: fix vlan offload for multi mac mode

The VLAN offload for AM65x CPSW2G is implemented using existing ALE APIs,
which are also used by legacy CPSW drivers.
So, now it always adds current Ext. Port and Host as VLAN members when VLAN
is added by 8021Q core (.ndo_vlan_rx_add_vid) and forcibly removes VLAN
from ALE table in .ndo_vlan_rx_kill_vid(). This works as for AM65x CPSW2G
(which has only one Ext. Port) as for legacy CPSW devices (which can't
support same VLAN on more then one Port in multi mac (dual-mac) mode). But
it doesn't work for the new J721E and AM64x multi port CPSWxG versions
doesn't have such restrictions and allow to offload the same VLAN on any
number of ports.

Now the attempt to add same VLAN on two (or more) K3 CPSWxG Ports will
cause:
- VLAN members mask overwrite when VLAN is added
- VLAN removal from ALE table when any Port removes VLAN

This patch fixes an issue by:
- switching to use cpsw_ale_vlan_add_modify() instead of
   cpsw_ale_add_vlan() when VLAN is added to ALE table, so VLAN members
   mask will not be overwritten;
- Updates cpsw_ale_del_vlan() as:
     if more than one ext. Port is in VLAN member mask
     then remove only current port from VLAN member mask
     else remove VLAN ALE entry

Example:
  add: P1 | P0 (Host) -> members mask: P1 | P0
  add: P2 | P0        -> members mask: P2 | P1 | P0
  rem: P1 | P0        -> members mask: P2 | P0
  rem: P2 | P0        -> members mask: -

The VLAN is forcibly removed if port_mask=0 passed to cpsw_ale_del_vlan()
to preserve existing legacy CPSW drivers functionality.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: cpsw_ale: add cpsw_ale_vlan_del_modify()

Add/export cpsw_ale_vlan_del_modify() and use it in cpsw_switchdev instead
of generic cpsw_ale_del_vlan() to avoid mixing 8021Q and switchdev VLAN
offload. This is preparation patch equired by follow up changes.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: use cppi5_desc_is_tdcm()

Use cppi5_desc_is_tdcm() helper for teardown indicator detection instead of
hard-coded value.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: move free desc queue mode selection in pdata

In preparation of adding more multi-port K3 CPSW versions move free
descriptor queue mode selection in am65_cpsw_pdata, so it can be selected
basing on DT compatibility property.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ethernet: ti: am65-cpsw: move ale selection in pdata

In preparation of adding more multi-port K3 CPSW versions move ALE
selection in am65_cpsw_pdata, so it can be selected basing on DT
compatibility property.

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipv6: For kerneldoc warnings with W=1

net/ipv6/addrconf.c:2005: warning: Function parameter or member 'dev' not described in 'ipv6_dev_find'
net/ipv6/ip6_vti.c:138: warning: Function parameter or member 'ip6n' not described in 'vti6_tnl_bucket'
net/ipv6/ip6_tunnel.c:218: warning: Function parameter or member 'ip6n' not described in 'ip6_tnl_bucket'
net/ipv6/ip6_tunnel.c:238: warning: Function parameter or member 'ip6n' not described in 'ip6_tnl_link'
net/ipv6/ip6_tunnel.c:254: warning: Function parameter or member 'ip6n' not described in 'ip6_tnl_unlink'
net/ipv6/ip6_tunnel.c:427: warning: Function parameter or member 'raw' not described in 'ip6_tnl_parse_tlv_enc_lim'
net/ipv6/ip6_tunnel.c:499: warning: Function parameter or member 'skb' not described in 'ip6_tnl_err'
net/ipv6/ip6_tunnel.c:499: warning: Function parameter or member 'ipproto' not described in 'ip6_tnl_err'
net/ipv6/ip6_tunnel.c:499: warning: Function parameter or member 'opt' not described in 'ip6_tnl_err'
net/ipv6/ip6_tunnel.c:499: warning: Function parameter or member 'type' not described in 'ip6_tnl_err'
net/ipv6/ip6_tunnel.c:499: warning: Function parameter or member 'code' not described in 'ip6_tnl_err'
net/ipv6/ip6_tunnel.c:499: warning: Function parameter or member 'msg' not described in 'ip6_tnl_err'
net/ipv6/ip6_tunnel.c:499: warning: Function parameter or member 'info' not described in 'ip6_tnl_err'
net/ipv6/ip6_tunnel.c:499: warning: Function parameter or member 'offset' not described in 'ip6_tnl_err'

ip6_tnl_err() is an internal function, so remove the kerneldoc. For
the others, add the missing parameters.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20201031183044.1082193-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: driver: hamradio: Fix potential unterminated string

With W=1 the following error is reported:

In function ‘strncpy’,
    inlined from ‘hdlcdrv_ioctl’ at drivers/net/hamradio/hdlcdrv.c:600:4:
./include/linux/string.h:297:30: warning: ‘__builtin_strncpy’ specified bound 32 equals destination size [-Wstringop-truncation]
  297 | #define __underlying_strncpy __builtin_strncpy
      |                              ^
./include/linux/string.h:307:9: note: in expansion of macro ‘__underlying_strncpy’
  307 |  return __underlying_strncpy(p, q, size);

Replace strncpy with strlcpy to guarantee the string is terminated.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20201031181700.1081693-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

drivers: net: wan: lmc: Fix W=1 set but used variable warnings

drivers/net/wan/lmc/lmc_main.c: In function ‘lmc_ioctl’:
drivers/net/wan/lmc/lmc_main.c:356:25: warning: variable ‘mii’ set but not used [-Wunused-but-set-variable]
  356 |                     u16 mii;
      |                         ^~~
drivers/net/wan/lmc/lmc_main.c:427:25: warning: variable ‘mii’ set but not used [-Wunused-but-set-variable]
  427 |                     u16 mii;
      |                         ^~~
drivers/net/wan/lmc/lmc_main.c: In function ‘lmc_interrupt’:
drivers/net/wan/lmc/lmc_main.c:1188:9: warning: variable ‘firstcsr’ set but not used [-Wunused-but-set-variable]
1188 |     u32 firstcsr;

This file has funky indentation, and makes little use of tabs. Keep
with this style in the patch, but that makes checkpatch unhappy.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20201031181417.1081511-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

drivers: net: xen-netfront: Fixed W=1 set but unused warnings

drivers/net/xen-netfront.c:2416:16: warning: variable ‘target’ set but not used [-Wunused-but-set-variable]
2416 | unsigned long target;

Remove target and just discard the return value from simple_strtoul().

This patch does give a checkpatch warning, but the warning was there
before anyway, as this file has lots of checkpatch warnings.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20201031180435.1081127-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'davicom-w-1-fixes'

Andrew Lunn says:

====================
davicom W=1 fixes

Fixup various W=1 warnings, and then add COMPILE_TEST support, which
explains why these where missed on the previous pass.
====================

Link: https://lore.kernel.org/r/20201031005833.1060316-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

drivers: net: davicom Add COMPILE_TEST support

Improve the build testing of this davicom driver by enabling it when
COMPILE_TEST is selected.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

drivers: net: davicom: Fixed unused but set variable with W=1

drivers/net/ethernet/davicom//dm9000.c: In function ‘dm9000_dumpblk_8bit’:
drivers/net/ethernet/davicom//dm9000.c:235:6: warning: variable ‘tmp’ set but not used [-Wunused-but-set-variable]

The driver needs to read packet data from the device even when the
packet is known bad. There is no need to assign the data to a variable
during this discard operation.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

drivers: net: tulip: Fix set but not used with W=1

When compiled for platforms other than __i386__ or __x86_64__:

drivers/net/ethernet/dec/tulip/tulip_core.c: In function ‘tulip_init_one’:
drivers/net/ethernet/dec/tulip/tulip_core.c:1296:13: warning: variable ‘last_irq’ set but not used [-Wunused-but-set-variable]
1296 | static int last_irq;

Add more #if defined() to totally remove the code when not needed.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20201031005445.1060112-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: add test script for bareudp tunnels

Test different encapsulation modes of the bareudp module:
  * Unicast MPLS,
  * IPv4 only,
  * IPv4 in multiproto mode (that is, IPv4 and IPv6),
  * IPv6.

Each mode is tested with both an IPv4 and an IPv6 underlay.

v2:
  * Add build dependencies in config file (Willem de Bruijn).
  * The MPLS test now uses its own IP addresses. This minimises
    the amount of cleanup between tests and simplifies the script.
  * Verify that iproute2 supports bareudp tunnels before running the
    script (and other minor usability improvements).

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Link: https://lore.kernel.org/r/8abc0e58f8a7eeb404f82466505a73110bc43ab8.1604088587.git.gnault@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-ast2400-2500-phy-handle-support'

Ivan Mikhaylov says:

====================
add ast2400/2500 phy-handle support

This patch introduces ast2400/2500 phy-handle support with an embedded
MDIO controller. At the current moment it is not possible to set options
with this format on ast2400/2500:

mac {
phy-handle = <&phy>;
phy-mode = "rgmii";

mdio {
#address-cells = <1>;
#size-cells = <0>;

phy: ethernet-phy@0 {
compatible = "ethernet-phy-idxxxx.yyyy";
reg = <0>;
};
};
};

The patch fixes it and gets possible PHYs and register them with
of_mdiobus_register.

Changes from v3:
   1. add dt-bindings description of MDIO node and phy-handle option
      with example.

Changes from v2:
   1. change manual phy interface type check on phy_interface_mode_is_rgmii
      function.
   2. add err_phy_connect label.
   3. split ftgmac100_destroy_mdio into ftgmac100_phy_disconnect and
      ftgmac100_destroy_mdio.
   4. remove unneeded mdio_np checks.

Changes from v1:
   1. split one patch into two.
====================

Link: https://lore.kernel.org/r/20201030133707.12099-1-i.mikhaylov@yadro.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: ftgmac100: describe phy-handle and MDIO

Add the phy-handle and MDIO description and add the example with
PHY and MDIO nodes.

Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ftgmac100: add handling of mdio/phy nodes for ast2400/2500

phy-handle can't be handled well for ast2400/2500 which has an embedded
MDIO controller. Add ftgmac100_mdio_setup for ast2400/2500 and initialize
PHYs from mdio child node with of_mdiobus_register.

Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ftgmac100: move phy connect out from ftgmac100_setup_mdio

Split MDIO registration and PHY connect into ftgmac100_setup_mdio and
ftgmac100_mii_probe.

Signed-off-by: Ivan Mikhaylov <i.mikhaylov@yadro.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests/net: timestamping: add ptp v2 support

The timestamping tool is supporting now only PTPv1 (IEEE-1588 2002) while
modern HW often supports also/only PTPv2.

Hence timestamping tool is still useful for sanity testing of PTP drivers
HW timestamping capabilities it's reasonable to upstate it to support
PTPv2. This patch adds corresponding support which can be enabled by using
new parameter "PTPV2".

Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Link: https://lore.kernel.org/r/20201029190931.30883-1-grygorii.strashko@ti.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: 9p: Fix kerneldoc warnings of missing parameters etc

net/9p/client.c:420: warning: Function parameter or member 'c' not described in 'p9_client_cb'
net/9p/client.c:420: warning: Function parameter or member 'req' not described in 'p9_client_cb'
net/9p/client.c:420: warning: Function parameter or member 'status' not described in 'p9_client_cb'
net/9p/client.c:568: warning: Function parameter or member 'uidata' not described in 'p9_check_zc_errors'
net/9p/trans_common.c:23: warning: Function parameter or member 'nr_pages' not described in 'p9_release_pages'
net/9p/trans_common.c:23: warning: Function parameter or member 'pages' not described in 'p9_release_pages'
net/9p/trans_fd.c:132: warning: Function parameter or member 'rreq' not described in 'p9_conn'
net/9p/trans_fd.c:132: warning: Function parameter or member 'wreq' not described in 'p9_conn'
net/9p/trans_fd.c:56: warning: Function parameter or member 'privport' not described in 'p9_fd_opts'
net/9p/trans_rdma.c:113: warning: Function parameter or member 'cqe' not described in 'p9_rdma_context'
net/9p/trans_rdma.c:129: warning: Function parameter or member 'privport' not described in 'p9_rdma_opts'
net/9p/trans_virtio.c:215: warning: Function parameter or member 'limit' not described in 'pack_sg_list_p'
net/9p/trans_virtio.c:83: warning: Function parameter or member 'chan_list' not described in 'virtio_chan'
net/9p/trans_virtio.c:83: warning: Function parameter or member 'p9_max_pages' not described in 'virtio_chan'
net/9p/trans_virtio.c:83: warning: Function parameter or member 'ring_bufs_avail' not described in 'virtio_chan'
net/9p/trans_virtio.c:83: warning: Function parameter or member 'tag' not described in 'virtio_chan'
net/9p/trans_virtio.c:83: warning: Function parameter or member 'vc_wq' not described in 'virtio_chan'

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Dominique Martinet <asmadeus@codewreck.org>
Link: https://lore.kernel.org/r/20201031182655.1082065-1-andrew@lunn.ch
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: mcast: fix stub definition of br_multicast_querier_exists

The commit cited below has changed only the functional prototype of
br_multicast_querier_exists, but forgot to do that for the stub
prototype (the one where CONFIG_BRIDGE_IGMP_SNOOPING is disabled).

Fixes: 955062b03fa6 ("net: bridge: mcast: add support for raw L2 multicast groups")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20201101000845.190009-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: use pm_runtime_put_sync in rtl_open error path

We can safely runtime-suspend the chip if rtl_open() fails. Therefore
switch the error path to use pm_runtime_put_sync() as well.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/aa093b1e-f295-5700-1cb7-954b54dd8f17@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: remove unneeded memory barrier in rtl_tx

tp->dirty_tx isn't changed outside rtl_tx(). Therefore I see no need
to guarantee a specific order of reading tp->dirty_tx and tp->cur_tx.
Having said that we can remove the memory barrier.
In addition use READ_ONCE() when reading tp->cur_tx because it can
change in parallel to rtl_tx().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/2264563a-fa9e-11b0-2c42-31bc6b8e2790@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ne2k: Fix Typo in RW-Bugfix

Correct a typo in ne.c and ne2k-pci.c which
prevented activation of the RW-Bugfix.

Signed-off-by: Armin Wolf <W_Armin@gmx.de>
Link: https://lore.kernel.org/r/20201029143357.7008-1-W_Armin@gmx.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: macb: add support for high speed interface

This patch adds support for 10GBASE-R interface to the linux driver for
Cadence's ethernet controller.
This controller has separate MAC's and PCS'es for low and high speed paths.
High speed PCS supports 100M, 1G, 2.5G, 5G and 10G through rate adaptation
implementation. However, since it doesn't support auto negotiation, linux
driver is modified to support 10GBASE-R instead of USXGMII.

Signed-off-by: Parshuram Thombare <pthombar@cadence.com>
Link: https://lore.kernel.org/r/1603975627-18338-1-git-send-email-pthombar@cadence.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/smc: improve return codes for SMC-Dv2

To allow better problem diagnosis the return codes for SMC-Dv2 are
improved by this patch. A few more CLC DECLINE codes are defined and
sent to the peer when an SMC connection cannot be established.
There are now multiple SMC variations that are offered by the client and
the server may encounter problems to initialize all of them.
Because only one diagnosis code can be sent to the client the decision
was made to send the first code that was encountered. Because the server
tries the variations in the order of importance (SMC-Dv2, SMC-D, SMC-R)
this makes sure that the diagnosis code of the most important variation
is sent.

v2: initialize rc in smc_listen_v2_check().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Link: https://lore.kernel.org/r/20201031181938.69903-1-kgraul@linux.ibm.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'support-for-octeontx2-98xx-silcion'

Subbaraya Sundeep says:

====================
Support for OcteonTx2 98xx silicon

OcteonTx2 series of silicons have multiple variants, the
98xx variant has two network interface controllers (NIX blocks)
each of which supports upto 100Gbps. Similarly 98xx supports
two crypto blocks (CPT) to double the crypto performance.
The current RVU drivers support a single NIX and
CPT blocks, this patchset adds support for multiple
blocks of same type to be active at the same time.

Also the number of serdes controllers (CGX) have increased
from three to five on 98xx. Each of the CGX block supports
upto 4 physical interfaces depending on the serdes mode ie
upto 20 physical interfaces. At a time each CGX block can
be mapped to a single NIX. The HW configuration to map CGX
and NIX blocks is done by firmware.

NPC has two new interfaces added NIX1_RX and NIX1_TX
similar to NIX0 interfaces. Also MCAM entries is increased
from 4k to 16k. To support the 16k entries extended set
is added in hardware which are at completely different
register offsets. Fortunately new constant registers
can be read to figure out the extended set is present
or not.

This patch set modifies existing AF and PF drivers
in below order to support 98xx:
- Prepare for supporting multiple blocks of same type.
  Functions which operate with block type to get or set
  resources count are modified to operate with block address
- Manage allocating and freeing LFs from new NIX1 and CPT1 RVU blocks.
- NIX block specific initialization and teardown for NIX1
- Based on the mapping set by Firmware, assign the NIX block
  LFs to a PF/VF.
- Multicast entries context is setup for NIX1 along with NIX0
- NPC changes to support extended set of MCAM entries, counters
  and NIX1 interfaces to NPC.
- All the mailbox changes required for the new blocks in 98xx.
- Since there are more CGX links in 98xx the hardcoded LBK
  link value needed by netdev drivers is not sufficient any
  more. Hence AF consumers need to get the number of all links
  and calculate the LBK link.
- Debugfs changes to display NIX1 contexts similar to NIX0
- Debugfs change to display mapping between CGX, NIX and PF.
====================

Link: https://lore.kernel.org/r/1603948549-781-1-git-send-email-sundeep.lkml@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Display CGX, NIX and PF map in debugfs.

Unlike earlier silicon variants, OcteonTx2 98xx
silicon has 2 NIX blocks and each of the CGX is
mapped to either of the NIX blocks. Each NIX
block supports 100G. Mapping btw NIX blocks and
CGX is done by firmware based on CGX speed config
to have a maximum possible network bandwidth.
Since the mapping is not fixed, it's difficult
for a user to figure out. Hence added a debugfs
entry which displays mapping between CGX LMAC,
NIX block and RVU PF.
Sample result of this entry ::

~# cat /sys/kernel/debug/octeontx2/rvu_pf_cgx_map
PCI dev RVU PF Func NIX block CGX LMAC
0002:02:00.0 0x400 NIX0 CGX0 LMAC0

Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Display NIX1 also in debugfs

If NIX1 block is also implemented then add a new
directory for NIX1 in debugfs root. Stats of
NIX1 block can be read/writen from/to the files
in directory "/sys/kernel/debug/octeontx2/nix1/".

Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-pf: Calculate LBK link instead of hardcoding

CGX links are followed by LBK links but number of
CGX and LBK links varies between platforms. Hence
get the number of links present in hardware from
AF and use it to calculate LBK link number.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Mbox changes for 98xx

This patch puts together all mailbox changes
for 98xx silicon:

Attach ->
Modify resource attach mailbox handler to
request LFs from a block address out of multiple
blocks of same type. If a PF/VF need LFs from two
blocks of same type then attach mbox should be
called twice.

Example:
        struct rsrc_attach *attach;
        .. Allocate memory for message ..
        attach->cptlfs = 3; /* 3 LFs from CPT0 */
        .. Send message ..
        .. Allocate memory for message ..
        attach->modify = 1;
        attach->cpt_blkaddr = BLKADDR_CPT1;
        attach->cptlfs = 2; /* 2 LFs from CPT1 */
        .. Send message ..

Detach ->
Update detach mailbox and its handler to detach
resources from CPT1 and NIX1 blocks.

MSIX ->
Updated the MSIX mailbox and its handler to return
MSIX offsets for the new block CPT1.

Free resources ->
Update free_rsrc mailbox and its handler to return
the free resources count of new blocks NIX1 and CPT1

Links ->
Number of CGX,LBK and SDP links may vary between
platforms. For example, in 98xx number of CGX and LBK
links are more than 96xx. Hence the info about number
of links present in hardware is useful for consumers to
request link configuration properly. This patch sends
this info in nix_lf_alloc_rsp.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Add NIX1 interfaces to NPC

On 98xx silicon, NPC block has additional
mcam entries, counters and NIX1 interfaces.
Extended set of registers are present for the
new mcam entries and counters.
This patch does the following:
- updates the register accessing macros
to use extended set if present.
- configures the MKEX profile for NIX1 interfaces also.
- updates mcam entry write functions to use assigned
NIX0/1 interfaces for the PF/VF.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Setup MCE context for assigned NIX

Initialize MCE context for the assigned NIX0/1
block for a CGX mapped PF. Modified rvu_nix_aq_enq_inst
function to work with nix_hw so that MCE contexts
for both NIX blocks can be inited.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Map NIX block from CGX connection

Firmware configures NIX block mapping for all CGXs
to achieve maximum throughput. This patch reads
the configuration and create mapping between RVU
PF and NIX blocks. And for LBK VFs assign NIX0 for
even numbered VFs and NIX1 for odd numbered VFs.

Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Initialize NIX1 block

This patch modifies NIX functions to operate
with nix_hw context so that existing functions
can be used for both NIX0 and NIX1 blocks. And
the NIX blocks present in the system are initialized
during driver init and freed during exit.

Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Manage new blocks in 98xx

AF manages the tasks of allocating, freeing
LFs from RVU blocks to PF and VFs. With new
NIX1 and CPT1 blocks in 98xx, this patch
adds support for handling new blocks too.

Co-developed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Update get/set resource count functions

Since multiple blocks of same type are present in
98xx, modify functions which get resource count and
which update resource count to work with individual
block address instead of block type.

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Sunil Goutham <sgoutham@marvell.com>
Signed-off-by: Rakesh Babu <rsaladi2@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: axienet: Properly handle PCS/PMA PHY for 1000BaseX mode

Update the axienet driver to properly support the Xilinx PCS/PMA PHY
component which is used for 1000BaseX and SGMII modes, including
properly configuring the auto-negotiation mode of the PHY and reading
the negotiated state from the PHY.

Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Reviewed-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Link: https://lore.kernel.org/r/20201028171429.1699922-1-robert.hancock@calian.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: avoid a bogus warning

The previous commit added support for IPA having up to six source
and destination resources.  But currently nothing uses more than
four.  (Five of each are used in a newer version of the hardware.)

I find that in one of my build environments the compiler complains
about newly-added code in two spots.  Inspection shows that the
warnings have no merit, but this compiler does not recognize that.

    ipa_main.c:457:39: warning: array index 5 is past the end of the
        array (which contains 4 elements) [-Warray-bounds]
    (and the same warning at line 483)

We can make this warning go away by changing the number of elements
in the source and destination resource limit arrays--now rather than
waiting until we need it to support the newer hardware.  This change
was coming soon anyway; make it now to get rid of the warning.

Signed-off-by: Alex Elder <elder@linaro.org>
Link: https://lore.kernel.org/r/20201031151524.32132-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-add-functionality-to-net-core-byte-packet-counters-and-use-it-in-r8169'

Heiner Kallweit says:

====================
net: add functionality to net core byte/packet counters and use it in r8169

This series adds missing functionality to the net core handling of
byte/packet counters and statistics. The extensions are then used
to remove private rx/tx byte/packet counters in r8169 driver.
====================

Link: https://lore.kernel.org/r/1fdb8ecd-be0a-755d-1d92-c62ed8399e77@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: remove no longer needed private rx/tx packet/byte counters

After switching to the net core rx/tx byte/packet counters we can
remove the now unused private version.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

r8169: use struct pcpu_sw_netstats for rx/tx packet/byte counters

Switch to the net core rx/tx byte/packet counter infrastructure.
This simplifies the code, only small drawback is some memory overhead
because we use just one queue, but allocate the counters per cpu.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: core: add devm_netdev_alloc_pcpu_stats

We have netdev_alloc_pcpu_stats(), and we have devm_alloc_percpu().
Add a managed version of netdev_alloc_pcpu_stats, e.g. for allocating
the per-cpu stats in the probe() callback of a driver. It needs to be
a macro for dealing properly with the type argument.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: core: add dev_sw_netstats_tx_add

Add dev_sw_netstats_tx_add(), complementing already existing
dev_sw_netstats_rx_add(). Other than dev_sw_netstats_rx_add allow to
pass the number of packets as function argument.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'in_interrupt-cleanup-part-2'

Sebastian Andrzej Siewior says:

====================
in_interrupt() cleanup, part 2

in the discussion about preempt count consistency across kernel configurations:

https://lore.kernel.org/r/20200914204209.256266093@linutronix.de/

Linus clearly requested that code in drivers and libraries which changes
behaviour based on execution context should either be split up so that
e.g. task context invocations and BH invocations have different interfaces
or if that's not possible the context information has to be provided by the
caller which knows in which context it is executing.

This includes conditional locking, allocation mode (GFP_*) decisions and
avoidance of code paths which might sleep.

In the long run, usage of 'preemptible, in_*irq etc.' should be banned from
driver code completely.

This is part two addressing remaining drivers except for orinoco-usb.
====================

Cherry picking only Ethernet changes.

Link: https://lore.kernel.org/r/20201027225454.3492351-1-bigeasy@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: tlan: Replace in_irq() usage

The driver uses in_irq() to determine if the tlan_priv::lock has to be
acquired in tlan_mii_read_reg() and tlan_mii_write_reg().

The interrupt handler acquires the lock outside of these functions so the
in_irq() check is meant to prevent a lock recursion deadlock. But this
check is incorrect when interrupt force threading is enabled because then
the handler runs in thread context and in_irq() correctly returns false.

The usage of in_*() in drivers is phased out and Linus clearly requested
that code which changes behaviour depending on context should either be
seperated or the context be conveyed in an argument passed by the caller,
which usually knows the context.

tlan_set_timer() has this conditional as well, but this function is only
invoked from task context or the timer callback itself. So it always has to
lock and the check can be removed.

tlan_mii_read_reg(), tlan_mii_write_reg() and tlan_phy_print() are invoked
from interrupt and other contexts.

Split out the actual function body into helper variants which are called
from interrupt context and make the original functions wrappers which
acquire tlan_priv::lock unconditionally.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Samuel Chessman <chessman@tux.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: forcedeth: Replace context and lock check with a lockdep_assert()

nv_update_stats() triggers a WARN_ON() when invoked from hard interrupt
context because the locks in use are not hard interrupt safe. It also has
an assert_spin_locked() which was the lock check before the lockdep era.

Lockdep has way broader locking correctness checks and covers both issues,
so replace the warning and the lock assert with lockdep_assert_held().

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Rain River <rain.1986.08.12@gmail.com>
Cc: Zhu Yanjun <zyjzyj2000@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: neterion: s2io: Replace in_interrupt() for context detection

wait_for_cmd_complete() uses in_interrupt() to detect whether it is safe to
sleep or not.

The usage of in_interrupt() in drivers is phased out and Linus clearly
requested that code which changes behaviour depending on context should
either be seperated or the context be conveyed in an argument passed by the
caller, which usually knows the context.

in_interrupt() also is only partially correct because it fails to chose the
correct code path when just preemption or interrupts are disabled.

Add an argument 'may_block' to both functions and adjust the callers to
pass the context information.

The following call chains which end up invoking wait_for_cmd_complete()
were analyzed to be safe to sleep:

s2io_card_up()
   s2io_set_multicast()

init_nic()
   init_tti()

s2io_close()
   do_s2io_delete_unicast_mc()
     do_s2io_add_mac()

s2io_set_mac_addr()
   do_s2io_prog_unicast()
     do_s2io_add_mac()

s2io_reset()
   do_s2io_restore_unicast_mc()
     do_s2io_add_mc()
       do_s2io_add_mac()

s2io_open()
   do_s2io_prog_unicast()
     do_s2io_add_mac()

The following call chains which end up invoking wait_for_cmd_complete()
were analyzed to be safe to sleep:

__dev_set_rx_mode()
    s2io_set_multicast()

s2io_txpic_intr_handle()
   s2io_link()
     init_tti()

Add a may_sleep argument to wait_for_cmd_complete(), s2io_set_multicast()
and init_tti() and hand the context information in from the call sites.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Jon Mason <jdmason@kudzu.us>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'l2-multicast-forwarding-for-ocelot-switch'

Vladimir Oltean says:

====================
L2 multicast forwarding for Ocelot switch

This series enables the mscc_ocelot switch to forward raw L2 (non-IP)
mdb entries as configured by the bridge driver after this patch:

https://patchwork.ozlabs.org/project/netdev/patch/20201028233831.610076-1-vladimir.oltean@nxp.com/
====================

Link: https://lore.kernel.org/r/20201029022738.722794-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: support L2 multicast entries

There is one main difference in mscc_ocelot between IP multicast and L2
multicast. With IP multicast, destination ports are encoded into the
upper bytes of the multicast MAC address. Example: to deliver the
address 01:00:5E:11:22:33 to ports 3, 8, and 9, one would need to
program the address of 00:03:08:11:22:33 into hardware. Whereas for L2
multicast, the MAC table entry points to a Port Group ID (PGID), and
that PGID contains the port mask that the packet will be forwarded to.
As to why it is this way, no clue. My guess is that not all port
combinations can be supported simultaneously with the limited number of
PGIDs, and this was somehow an issue for IP multicast but not for L2
multicast. Anyway.

Prior to this change, the raw L2 multicast code was bogus, due to the
fact that there wasn't really any way to test it using the bridge code.
There were 2 issues:
- A multicast PGID was allocated for each MDB entry, but it wasn't in
  fact programmed to hardware. It was dummy.
- In fact we don't want to reserve a multicast PGID for every single MDB
  entry. That would be odd because we can only have ~60 PGIDs, but
  thousands of MDB entries. So instead, we want to reserve a multicast
  PGID for every single port combination for multicast traffic. And
  since we can have 2 (or more) MDB entries delivered to the same port
  group (and therefore PGID), we need to reference-count the PGIDs.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: make entry_type a member of struct ocelot_multicast

This saves a re-classification of the MDB address on deletion.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: remove the "new" variable in ocelot_port_mdb_add

It is Not Needed, a comment will suffice.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: use ether_addr_copy

Since a helper is available for copying Ethernet addresses, let's use it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: classify L2 mdb entries as LOCKED

ocelot.h says:

/* MAC table entry types.
* ENTRYTYPE_NORMAL is subject to aging.
* ENTRYTYPE_LOCKED is not subject to aging.
* ENTRYTYPE_MACv4 is not subject to aging. For IPv4 multicast.
* ENTRYTYPE_MACv6 is not subject to aging. For IPv6 multicast.
*/

We don't want the permanent entries added with 'bridge mdb' to be
subject to aging.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: explicitly convert between mdb entry state and port group flags

When creating a new multicast port group, there is implicit conversion
between the __u8 state member of struct br_mdb_entry and the unsigned
char flags member of struct net_bridge_port_group. This implicit
conversion relies on the fact that MDB_PERMANENT is equal to
MDB_PG_FLAGS_PERMANENT.

Let's be more explicit and convert the state to flags manually.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20201028234815.613226-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: mcast: add support for raw L2 multicast groups

Extend the bridge multicast control and data path to configure routes
for L2 (non-IP) multicast groups.

The uapi struct br_mdb_entry union u is extended with another variant,
mac_addr, which does not change the structure size, and which is valid
when the proto field is zero.

To be compatible with the forwarding code that is already in place,
which acts as an IGMP/MLD snooping bridge with querier capabilities, we
need to declare that for L2 MDB entries (for which there exists no such
thing as IGMP/MLD snooping/querying), that there is always a querier.
Otherwise, these entries would be flooded to all bridge ports and not
just to those that are members of the L2 multicast group.

Needless to say, only permanent L2 multicast groups can be installed on
a bridge port.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20201028233831.610076-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'sfc-ef100-tso-enhancements'

Edward Cree says:

====================
sfc: EF100 TSO enhancements

Support TSO over encapsulation (with GSO_PARTIAL), and over VLANs
(which the code already handled but we didn't advertise). Also
correct our handling of IPID mangling.

I couldn't find documentation of exactly what shaped SKBs we can
get given, so patch #2 is slightly guesswork, but when I tested
TSO over both underlay and (VxLAN) overlay, the checksums came
out correctly, so at least in those cases the edits we're making
must be the right ones.
Similarly, I'm not 100% sure I've correctly understood how FIXEDID
and MANGLEID are supposed to work in patch #3.
====================

Link: https://lore.kernel.org/r/6e1ea05f-faeb-18df-91ef-572445691d89@solarflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: advertise our vlan features

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: only use fixed-id if the skb asks for it

AIUI, the NETIF_F_TSO_MANGLEID flag is a signal to the stack that a
driver may _need_ to mangle IDs in order to do TSO, and conversely
a signal from the stack that the driver is permitted to do so.
Since we support both fixed and incrementing IPIDs, we should rely
on the SKB_GSO_FIXEDID flag on a per-skb basis, rather than using
the MANGLEID feature to make all TSOs fixed-id.
Includes other minor cleanups of ef100_make_tso_desc() coding style.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: implement encap TSO on EF100

The NIC only needs to know where the headers it has to edit (TCP and
inner and outer IPv4) are, which fits GSO_PARTIAL nicely.
It also supports non-PARTIAL offload of UDP tunnels, again just
needing to be told the outer transport offset so that it can edit
the UDP length field.
(It's not clear to me whether the stack will ever use the non-PARTIAL
version with the netdev feature flags we're setting here.)

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: extend bitfield macros to 17 fields

We need EFX_POPULATE_OWORD_17 for an encap TSO descriptor on EF100.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipa-minor-bug-fixes'

Alex Elder says:

====================
net: ipa: minor bug fixes

This series fixes several bugs.  They are minor, in that the code
currently works on supported platforms even without these patches
applied, but they're bugs nevertheless and should be fixed.

Version 2 improves the commit message for the fourth patch.  It also
fixes a bug in two spots in the last patch.  Both of these changes
were suggested by Willem de Bruijn.
====================

Link: https://lore.kernel.org/r/20201028194148.6659-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: avoid going past end of resource group array

The minimum and maximum limits for resources assigned to a given
resource group are programmed in pairs, with the limits for two
groups set in a single register.

If the number of supported resource groups is odd, only half of the
register that defines these limits is valid for the last group; that
group has no second group in the pair.

Currently we ignore this constraint, and it turns out to be harmless,
but it is not guaranteed to be.  This patch addresses that, and adds
support for programming the 5th resource group's limits.

Rework how the resource group limit registers are programmed by
having a single function program all group pairs rather than having
one function program each pair.  Add the programming of the 4-5
resource group pair limits to this function.  If a resource group is
not supported, pass a null pointer to ipa_resource_config_common()
for that group and have that function write zeroes in that case.

Tested-by: Sujit Kautkar <sujitka@chromium.org>
Signed-off-by: Alex Elder <elder@linaro.org>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: distinguish between resource group types

The number of resource groups supported by the hardware can be
different for source and destination resources.  Determine the
number supported for each using separate functions.  Make the
functions inline end move their definitions into "ipa_reg.h",
because they determine whether certain register definitions are
valid.  Pass just the IPA hardware version as argument.

IPA_RESOURCE_GROUP_COUNT represents the maximum number of resource
groups the driver supports for any hardware version.  Change that
symbol to be two separate constants, one for source and the other
for destination resource groups.  Rename them to end with "_MAX"
rather than "_COUNT", to reflect their true purpose.

Tested-by: Sujit Kautkar <sujitka@chromium.org>
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: assign endpoint to a resource group

The IPA hardware manages various resources (e.g. descriptors)
internally to perform its functions. The resources are grouped,
allowing different endpoints to use separate resource pools. This
way one group of endpoints can be configured to operate unaffected
by the resource use of endpoints in a different group.

Endpoints should be assigned to a resource group, but we currently
don't do that.

Define a new resource_group field in the endpoint configuration
data, and use it to assign the proper resource group to use for
each AP endpoint.

Tested-by: Sujit Kautkar <sujitka@chromium.org>
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: fix resource group field mask definition

The mask for the RSRC_GRP field in the INIT_RSRC_GRP endpoint
initialization register is incorrectly defined for IPA v4.2 (where
it is only one bit wide).  So we need to fix this.

The fix is not straightforward, however.  Field masks are passed to
functions like u32_encode_bits(), and for that they must be constant.

To address this, we define a new inline function that returns the
*encoded* value to use for a given RSRC_GRP field, which depends on
the IPA version.  The caller can then use something like this, to
assign a given endpoint resource id 1:

    u32 offset = IPA_REG_ENDP_INIT_RSRC_GRP_N_OFFSET(endpoint_id);
    u32 val = rsrc_grp_encoded(ipa->version, 1);

    iowrite32(val, ipa->reg_virt + offset);

The next patch requires this fix.

Tested-by: Sujit Kautkar <sujitka@chromium.org>
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: assign proper packet context base

At the end of ipa_mem_setup() we write the local packet processing
context base register to tell it where the processing context memory
is. But we are writing the wrong value.

The value written turns out to be the offset of the modem header
memory region (assigned earlier in the function). Fix this bug.

Tested-by: Sujit Kautkar <sujitka@chromium.org>
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dec: tulip: de2104x: Add shutdown handler to stop NIC

The driver does not implement a shutdown handler which leads to issues
when using kexec in certain scenarios. The NIC keeps on fetching
descriptors which gets flagged by the IOMMU with errors like this:

DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr fffff000
DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr fffff000
DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr fffff000
DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr fffff000
DMAR: DMAR:[DMA read] Request device [5e:00.0]fault addr fffff000

Signed-off-by: Moritz Fischer <mdf@kernel.org>
Link: https://lore.kernel.org/r/20201028172125.496942-1-mdf@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: marvell: add special handling of Finisar modules with 88E1111

The Finisar FCLF8520P2BTL 1000BaseT SFP module uses a Marvel 88E1111 PHY
with a modified PHY ID. Add support for this ID using the 88E1111
methods.

By default these modules do not have 1000BaseX auto-negotiation enabled,
which is not generally desirable with Linux networking drivers. Add
handling to enable 1000BaseX auto-negotiation when these modules are
used in 1000BaseX mode. Also, some special handling is required to ensure
that 1000BaseT auto-negotiation is enabled properly when desired.

Based on existing handling in the AMD xgbe driver and the information in
the Finisar FAQ:
https://www.finisar.com/sites/default/files/resources/an-2036_1000base-t_sfp_faqreve1.pdf

Signed-off-by: Robert Hancock <robert.hancock@calian.com>
Reviewed-by: Russell King <rmk+kernel@armlinux.org.uk>
Link: https://lore.kernel.org/r/20201028171540.1700032-1-robert.hancock@calian.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'sctp-implement-rfc6951-udp-encapsulation-of-sctp'

Xin Long says:

====================
sctp: Implement RFC6951: UDP Encapsulation of SCTP

Description From the RFC:

   The Main Reasons:

   o  To allow SCTP traffic to pass through legacy NATs, which do not
      provide native SCTP support as specified in [BEHAVE] and
      [NATSUPP].

   o  To allow SCTP to be implemented on hosts that do not provide
      direct access to the IP layer.  In particular, applications can
      use their own SCTP implementation if the operating system does not
      provide one.

   Implementation Notes:

   UDP-encapsulated SCTP is normally communicated between SCTP stacks
   using the IANA-assigned UDP port number 9899 (sctp-tunneling) on both
   ends.  There are circumstances where other ports may be used on
   either end, and it might be required to use ports other than the
   registered port.

   Each SCTP stack uses a single local UDP encapsulation port number as
   the destination port for all its incoming SCTP packets, this greatly
   simplifies implementation design.

   An SCTP implementation supporting UDP encapsulation MUST maintain a
   remote UDP encapsulation port number per destination address for each
   SCTP association.  Again, because the remote stack may be using ports
   other than the well-known port, each port may be different from each
   stack.  However, because of remapping of ports by NATs, the remote
   ports associated with different remote IP addresses may not be
   identical, even if they are associated with the same stack.

   Because the well-known port might not be used, implementations need
   to allow other port numbers to be specified as a local or remote UDP
   encapsulation port number through APIs.

Patches:

   This patchset is using the udp4/6 tunnel APIs to implement the UDP
   Encapsulation of SCTP with not much change in SCTP protocol stack
   and with all current SCTP features keeped in Linux Kernel.

   1 - 4: Fix some UDP issues that may be triggered by SCTP over UDP.
   5 - 7: Process incoming UDP encapsulated packets and ICMP packets.
   8 -10: Remote encap port's update by sysctl, sockopt and packets.
   11-14: Process outgoing pakects with UDP encapsulated and its GSO.
   15-16: Add the part from draft-tuexen-tsvwg-sctp-udp-encaps-cons-03.
      17: Enable this feature.

Tests:

  - lksctp-tools/src/func_tests with UDP Encapsulation enabled/disabled:

      Both make v4test and v6test passed.

  - sctp-tests with UDP Encapsulation enabled/disabled:

      repeatability/procdumps/sctpdiag/gsomtuchange/extoverflow/
      sctphashtable passed. Others failed as expected due to those
      "iptables -p sctp" rules.

  - netperf on lo/netns/virtio_net, with gso enabled/disabled and
    with ip_checksum enabled/disabled, with UDP Encapsulation
    enabled/disabled:

      No clear performance dropped.

v1->v2:
  - Fix some incorrect code in the patches 5,6,8,10,11,13,14,17, suggested
    by Marcelo.
  - Append two patches 15-16 to add the Additional Considerations for UDP
    Encapsulation of SCTP from draft-tuexen-tsvwg-sctp-udp-encaps-cons-03.
v2->v3:
  - remove the cleanup code in patch 2, suggested by Willem.
  - remove the patch 3 and fix the checksum in the new patch 3 after
    talking with Paolo, Marcelo and Guillaume.
  - add 'select NET_UDP_TUNNEL' in patch 4 to solve a compiling error.
  - fix __be16 type cast warning in patch 8.
  - fix the wrong endian orders when setting values in 14,16.
v3->v4:
  - add entries in ip-sysctl.rst in patch 7,16, as Marcelo Suggested.
  - not create udp socks when udp_port is set to 0 in patch 16, as
    Marcelo noticed.
v4->v5:
  - improve the description for udp_port and encap_port entries in patch
    7, 16.
  - use 0 as the default udp_port.
====================

Link: https://lore.kernel.org/r/cover.1603955040.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: enable udp tunneling socks

This patch is to enable udp tunneling socks by calling
sctp_udp_sock_start() in sctp_ctrlsock_init(), and
sctp_udp_sock_stop() in sctp_ctrlsock_exit().

Also add sysctl udp_port to allow changing the listening
sock's port by users.

Wit this patch, the whole sctp over udp feature can be
enabled and used.

v1->v2:
  - Also update ctl_sock udp_port in proc_sctp_do_udp_port()
    where netns udp_port gets changed.
v2->v3:
  - Call htons() when setting sk udp_port from netns udp_port.
v3->v4:
  - Not call sctp_udp_sock_start() when new_value is 0.
  - Add udp_port entry in ip-sysctl.rst.
v4->v5:
  - Not call sctp_udp_sock_start/stop() in sctp_ctrlsock_init/exit().
  - Improve the description of udp_port in ip-sysctl.rst.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: handle the init chunk matching an existing asoc

This is from Section 4 of draft-tuexen-tsvwg-sctp-udp-encaps-cons-03,
and it requires responding with an abort chunk with an error cause
when the udp source port of the received init chunk doesn't match the
encap port of the transport.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: add the error cause for new encapsulation port restart

This patch is to add the function to make the abort chunk with
the error cause for new encapsulation port restart, defined
on Section 4.4 in draft-tuexen-tsvwg-sctp-udp-encaps-cons-03.

v1->v2:
- no change.
v2->v3:
- no need to call htons() when setting nep.cur_port/new_port.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: support for sending packet over udp6 sock

This one basically does the similar things in sctp_v6_xmit as does for
udp4 sock in the last patch, just note that:

  1. label needs to be calculated, as it's the param of
     udp_tunnel6_xmit_skb().

  2. The 'nocheck' param of udp_tunnel6_xmit_skb() is false, as
     required by RFC.

v1->v2:
  - Use sp->udp_port instead in sctp_v6_xmit(), which is more safe.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: support for sending packet over udp4 sock

This patch does what the rfc6951#section-5.3 says for ipv4:

  "Within the UDP header, the source port MUST be the local UDP
   encapsulation port number of the SCTP stack, and the destination port
   MUST be the remote UDP encapsulation port number maintained for the
   association and the destination address to which the packet is sent
   (see Section 5.1).

   Because the SCTP packet is the UDP payload, the length of the UDP
   packet MUST be the length of the SCTP packet plus the size of the UDP
   header.

   The SCTP checksum MUST be computed for IPv4 and IPv6, and the UDP
   checksum SHOULD be computed for IPv4 and IPv6."

Some places need to be adjusted in sctp_packet_transmit():

  1. For non-gso packets, when transport's encap_port is set, sctp
     checksum has to be done in sctp_packet_pack(), as the outer
     udp will use ip_summed = CHECKSUM_PARTIAL to do the offload
     setting for checksum.

  2. Delay calling dst_clone() and skb_dst_set() for non-udp packets
     until sctp_v4_xmit(), as for udp packets, skb_dst_set() is not
     needed before calling udp_tunnel_xmit_skb().

then in sctp_v4_xmit():

  1. Go to udp_tunnel_xmit_skb() only when transport->encap_port and
     net->sctp.udp_port both are set, as these are one for dst port
     and another for src port.

  2. For gso packet, SKB_GSO_UDP_TUNNEL_CSUM is set for gso_type, and
     with this udp checksum can be done in __skb_udp_tunnel_segment()
     for each segments after the sctp gso.

  3. inner_mac_header and inner_transport_header are set, as these
     will be needed in __skb_udp_tunnel_segment() to find the right
     headers.

  4. df and ttl are calculated, as these are the required params by
     udp_tunnel_xmit_skb().

  5. nocheck param has to be false, as "the UDP checksum SHOULD be
     computed for IPv4 and IPv6", says in rfc6951#section-5.3.

v1->v2:
  - Use sp->udp_port instead in sctp_v4_xmit(), which is more safe.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: call sk_setup_caps in sctp_packet_transmit instead

sk_setup_caps() was originally called in Commit 90017accff61 ("sctp:
Add GSO support"), as:

"We have to refresh this in case we are xmiting to more than one
transport at a time"

This actually happens in the loop of sctp_outq_flush_transports(),
and it shouldn't be tied to gso, so move it out of gso part and
before sctp_packet_pack().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: add udphdr to overhead when udp_port is set

sctp_mtu_payload() is for calculating the frag size before making
chunks from a msg. So we should only add udphdr size to overhead
when udp socks are listening, as only then sctp can handle the
incoming sctp over udp packets and outgoing sctp over udp packets
will be possible.

Note that we can't do this according to transport->encap_port, as
different transports may be set to different values, while the
chunks were made before choosing the transport, we could not be
able to meet all rfc6951#section-5.6 recommends.

v1->v2:
- Add udp_port for sctp_sock to avoid a potential race issue, it
will be used in xmit path in the next patch.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: allow changing transport encap_port by peer packets

As rfc6951#section-5.4 says:

  "After finding the SCTP association (which
   includes checking the verification tag), the UDP source port MUST be
   stored as the encapsulation port for the destination address the SCTP
   packet is received from (see Section 5.1).

   When a non-encapsulated SCTP packet is received by the SCTP stack,
   the encapsulation of outgoing packets belonging to the same
   association and the corresponding destination address MUST be
   disabled."

transport encap_port should be updated by a validated incoming packet's
udp src port.

We save the udp src port in sctp_input_cb->encap_port, and then update
the transport in two places:

  1. right after vtag is verified, which is required by RFC, and this
     allows the existent transports to be updated by the chunks that
     can only be processed on an asoc.

  2. right before processing the 'init' where the transports are added,
     and this allows building a sctp over udp connection by client with
     the server not knowing the remote encap port.

  3. when processing ootb_pkt and creating the temporary transport for
     the reply pkt.

Note that sctp_input_cb->header is removed, as it's not used any more
in sctp.

v1->v2:
  - Change encap_port as __be16 for sctp_input_cb.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: add SCTP_REMOTE_UDP_ENCAPS_PORT sockopt

This patch is to implement:

  rfc6951#section-6.1: Get or Set the Remote UDP Encapsulation Port Number

with the param of the struct:

  struct sctp_udpencaps {
    sctp_assoc_t sue_assoc_id;
    struct sockaddr_storage sue_address;
    uint16_t sue_port;
  };

the encap_port of sock, assoc or transport can be changed by users,
which also means it allows the different transports of the same asoc
to have different encap_port value.

v1->v2:
  - no change.
v2->v3:
  - fix the endian warning when setting values between encap_port and
    sue_port.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: add encap_port for netns sock asoc and transport

encap_port is added as per netns/sock/assoc/transport, and the
latter one's encap_port inherits the former one's by default.
The transport's encap_port value would mostly decide if one
packet should go out with udp encapsulated or not.

This patch also allows users to set netns' encap_port by sysctl.

v1->v2:
  - Change to define encap_port as __be16 for sctp_sock, asoc and
    transport.
v2->v3:
  - No change.
v3->v4:
  - Add 'encap_port' entry in ip-sysctl.rst.
v4->v5:
  - Improve the description of encap_port in ip-sysctl.rst.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sctp: add encap_err_lookup for udp encap socks

As it says in rfc6951#section-5.5:

  "When receiving ICMP or ICMPv6 response packets, there might not be
   enough bytes in the payload to identify the SCTP association that the
   SCTP packet triggering the ICMP or ICMPv6 packet belongs to.  If a
   received ICMP or ICMPv6 packet cannot be related to a specific SCTP
   association or the verification tag cannot be verified, it MUST be
   discarded silently.  In particular, this means that the SCTP stack
   MUST NOT rely on receiving ICMP or ICMPv6 messages.  Implementation
   constraints could prevent processing received ICMP or ICMPv6
   messages."

ICMP or ICMPv6 packets need to be handled, and this is implemented by
udp encap sock .encap_err_lookup function.

The .encap_err_lookup function is called in __udp(6)_lib_err_encap()
to confirm this path does need to be updated. For sctp, what we can
do here is check if the corresponding asoc and transport exist.

Note that icmp packet process for sctp over udp is done by udp sock
.encap_err_lookup(), and it means for now we can't do as much as
sctp_v4/6_err() does. Also we can't do the two mappings mentioned
in rfc6951#section-5.5.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>