git.baikalelectronics.ru Git

netfilter: nftables: add nft_parse_register_load() and use it

This new function combines the netlink register attribute parser
and the load validation function.

This update requires to replace:

enum nft_registers sreg:8;

in many of the expression private areas otherwise compiler complains
with:

error: cannot take address of bit-field ‘sreg’

when passing the register field as reference.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: flowtable: add hash offset field to tuple

Add a placeholder field to calculate hash tuple offset. Similar to
89c365785ff0 ("netfilter: conntrack: avoid gcc-10 zero-length-bounds
warning").

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

ipvs: add weighted random twos choice algorithm

Adds the random twos choice load-balancing algorithm. The algorithm will
pick two random servers based on weights. Then select the server with
the least amount of connections normalized by weight. The algorithm
avoids the "herd behavior" problem. The algorithm comes from a paper
by Michael Mitzenmacher available here
http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/twosurvey.pdf

Signed-off-by: Darby Payne <darby.payne@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: remove get_ct indirection

Use nf_ct_get() directly, its a small inline helper without dependencies.

Add CONFIG_NF_CONNTRACK guards to elide the relevant part when conntrack
isn't available at all.

v2: add ifdef guard around nf_ct_get call (kernel test robot)
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge branch 'net-dsa-hellcreek-add-taprio-offloading'

Kurt Kanzenbach says:

====================
net: dsa: hellcreek: Add TAPRIO offloading

The switch has support for the 802.1Qbv Time Aware Shaper (TAS). Traffic
schedules may be configured individually on each front port. Each port
has eight egress queues. The traffic is mapped to a traffic class
respectively via the PCP field of a VLAN tagged frame.

Previous attempts:
* https://lkml.kernel.org/netdev/20201121115703.23221-1-kurt@linutronix.de/
* https://lkml.kernel.org/netdev/20210116124922.32356-1-kurt@linutronix.de/
====================

Link: https://lore.kernel.org/r/20210123105633.16753-1-kurt@linutronix.de
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: hellcreek: Add TAPRIO offloading support

The switch has support for the 802.1Qbv Time Aware Shaper (TAS). Traffic
schedules may be configured individually on each front port. Each port has eight
egress queues. The traffic is mapped to a traffic class respectively via the PCP
field of a VLAN tagged frame.

The TAPRIO Qdisc already implements that. Therefore, this interface can simply
be reused. Add .port_setup_tc() accordingly.

The activation of a schedule on a port is split into two parts:

* Programming the necessary gate control list (GCL)
* Setup delayed work for starting the schedule

The hardware supports starting a schedule up to eight seconds in the future. The
TAPRIO interface provides an absolute base time. Therefore, periodic delayed
work is leveraged to check whether a schedule may be started or not.

Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mhi: Set wwan device type

The 'wwan' devtype is meant for devices that require additional
configuration to be used, like WWAN specific APN setup over AT/QMI
commands, rmnet link creation, etc. This is the case for MHI (Modem
host Interface) netdev which targets modem/WWAN endpoints.

Signed-off-by: Loic Poulain <loic.poulain@linaro.org>
Link: https://lore.kernel.org/r/1611328554-1414-1-git-send-email-loic.poulain@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'udp-allow-forwarding-of-plain-non-fraglisted-udp-gro-packets'

Alexander Lobakin says:

====================
udp: allow forwarding of plain (non-fraglisted) UDP GRO packets

This series allows to form UDP GRO packets in cases without sockets
(for forwarding). To not change the current datapath, this is
performed only when the new corresponding netdev feature is enabled
via Ethtool (and fraglisted GRO is disabled).
Prior to this point, only fraglisted UDP GRO was available. Plain UDP
GRO shows better forwarding performance when a target NIC is capable
of GSO UDP offload.

Since v3 [2]:
- rename introduced netdev feature to reflect that it targets
   forwarding and don't touch fraglisted GRO at all (Willem de Bruijn).

Since v2 [1]:
- convert to a series;
- new: add new netdev_feature to explicitly enable/disable UDP GRO
   when there is no socket, defaults to off (Paolo Abeni).

Since v1 [0]:
- drop redundant 'if (sk)' check (Alexander Duyck);
- add a ref in the commit message to one more commit that was
   an important step for UDP GRO forwarding.

[0] https://lore.kernel.org/netdev/20210112211536.261172-1-alobakin@pm.me
[1] https://lore.kernel.org/netdev/20210113103232.4761-1-alobakin@pm.me
[2] https://lore.kernel.org/netdev/20210118193122.87271-1-alobakin@pm.me
====================

Link: https://lore.kernel.org/r/20210122181909.36340-1-alobakin@pm.me
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp: allow forwarding of plain (non-fraglisted) UDP GRO packets

Commit 2c55e1abaf0b ("udp: Support UDP fraglist GRO/GSO.") actually
not only added a support for fraglisted UDP GRO, but also tweaked
some logics the way that non-fraglisted UDP GRO started to work for
forwarding too.
Commit 6a9513211583 ("net: add GSO UDP L4 and GSO fraglists to the
list of software-backed types") added GSO UDP L4 to the list of
software GSO to allow virtual netdevs to forward them as is up to
the real drivers.

Tests showed that currently forwarding and NATing of plain UDP GRO
packets are performed fully correctly, regardless if the target
netdevice has a support for hardware/driver GSO UDP L4 or not.
Add the last element and allow to form plain UDP GRO packets if
we are on forwarding path, and the new NETIF_F_GRO_UDP_FWD is
enabled on a receiving netdevice.

If both NETIF_F_GRO_FRAGLIST and NETIF_F_GRO_UDP_FWD are set,
fraglisted GRO takes precedence. This keeps the current behaviour
and is generally more optimal for now, as the number of NICs with
hardware USO offload is relatively small.

Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: introduce a netdev feature for UDP GRO forwarding

Introduce a new netdev feature, NETIF_F_GRO_UDP_FWD, to allow user
to turn UDP GRO on and off for forwarding.
Defaults to off to not change current datapath.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Alexander Lobakin <alobakin@pm.me>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'remove-unneeded-phy-time-stamping-option'

Richard Cochran says:

====================
Remove unneeded PHY time stamping option.

The NETWORK_PHY_TIMESTAMPING configuration option adds additional
checks into the networking hot path, and it is only needed by two
rather esoteric devices, namely the TI DP83640 PHYTER and the ZHAW
InES 1588 IP core. Very few end users have these devices, and those
that do have them are building specialized embedded systems.

Unfortunately two unrelated drivers depend on this option, and two
defconfigs enable it. It is probably my fault for not paying enough
attention in reviews.

This series corrects the gratuitous use of NETWORK_PHY_TIMESTAMPING.
====================

Link: https://lore.kernel.org/r/cover.1611198584.git.richardcochran@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mvpp2: Remove unneeded Kconfig dependency.

The mvpp2 is an Ethernet driver, and it implements MAC style time
stamping of PTP frames. It has no need of the expensive option to
enable PHY time stamping. Remove the incorrect dependency.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: mv88e6xxx: Remove bogus Kconfig dependency.

The mv88e6xxx is a DSA driver, and it implements DSA style time
stamping of PTP frames. It has no need of the expensive option to
enable PHY time stamping. Remove the bogus dependency.

Signed-off-by: Richard Cochran <richardcochran@gmail.com>
Acked-by: Brandon Streiff <brandon.streiff@ni.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipa-napi-poll-updates'

Alex Elder says:

====================
net: ipa: NAPI poll updates

While reviewing the IPA NAPI polling code in detail I found two
problems. This series fixes those, and implements a few other
improvements to this part of the code.

The first two patches are minor bug fixes that avoid extra passes
through the poll function. The third simplifies code inside the
polling loop a bit.

The last two update how interrupts are disabled; previously it was
possible for another I/O completion condition to be recorded before
NAPI got scheduled.
====================

Link: https://lore.kernel.org/r/20210121114821.26495-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: disable IEOB interrupts before clearing

Currently in gsi_isr_ieob(), event ring IEOB interrupts are disabled
one at a time.  The loop disables the IEOB interrupt for all event
rings represented in the event mask.  Instead, just disable them all
at once.

Disable them all *before* clearing the interrupt condition.  This
guarantees we'll schedule NAPI for each event once, before another
IEOB interrupt could be signaled.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: repurpose gsi_irq_ieob_disable()

Rename gsi_irq_ieob_disable() to be gsi_irq_ieob_disable_one().

Introduce a new function gsi_irq_ieob_disable() that takes a mask of
events to disable rather than a single event id. This will be used
in the next patch.

Rename gsi_irq_ieob_enable() to be gsi_irq_ieob_enable_one() to be
consistent.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: have gsi_channel_update() return a value

Have gsi_channel_update() return the first transaction in the
updated completed transaction list, or NULL if no new transactions
have been added.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: heed napi_complete() return value

Pay attention to the return value of napi_complete(), completing
polling only if it returns true.

Just use napi rather than &channel->napi as the argument passed to
napi_complete().

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: count actual work done in gsi_channel_poll()

There is an off-by-one problem in gsi_channel_poll().  The count of
transactions completed is incremented each time through the loop
*before* determining whether there is any more work to do.  As a
result, if we exit the loop early the counter its value is one more
than the number of transactions actually processed.

Instead, increment the count after processing, to ensure it reflects
the number of processed transactions.  The result is more naturally
described as a for loop rather than a while loop, so change that.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlxsw-expose-number-of-physical-ports'

Ido Schimmel says:

====================
mlxsw: Expose number of physical ports

The switch ASIC has a limited capacity of physical ports that it can
support. While each system is brought up with a different number of
ports, this number can be increased via splitting up to the ASIC's
limit.

Expose physical ports as a devlink resource so that user space will have
visibility into the maximum number of ports that can be supported and
the current occupancy. With this resource it is possible, for example,
to write generic (i.e., not platform dependent) tests for port
splitting.

Patch #1 adds the new resource and patch #2 adds a selftest.

v2:
* Add the physical ports resource as a generic devlink resource so that
it could be re-used by other device drivers
====================

Link: https://lore.kernel.org/r/20210121131024.2656154-1-idosch@idosch.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mlxsw: Add a scale test for physical ports

Query the maximum number of supported physical ports using devlink-resource
and test that this number can be reached by splitting each of the
splittable ports to its width. Test that an error is returned in case
the maximum number is exceeded.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: Register physical ports as a devlink resource

The switch ASIC has a limited capacity of physical ('flavour physical'
in devlink terminology) ports that it can support. While each system is
brought up with a different number of ports, this number can be
increased via splitting up to the ASIC's limit.

Expose physical ports as a devlink resource so that user space will have
visibility to the maximum number of ports that can be supported and the
current occupancy.

In addition, add a "Generic Resources" section in devlink-resource
documentation so the different drivers will be aligned by the same resource
name when exposing to user space.

Signed-off-by: Danielle Ratson <danieller@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'htb-offload'

Maxim Mikityanskiy says:

====================
HTB offload

This series adds support for HTB offload to the HTB qdisc, and adds
usage to mlx5 driver.

The previous RFCs are available at [1], [2].

The feature is intended to solve the performance bottleneck caused by
the single lock of the HTB qdisc, which prevents it from scaling well.
The HTB algorithm itself is offloaded to the device, eliminating the
need to take the root lock of HTB on every packet. Classification part
is done in clsact (still in software) to avoid acquiring the lock, which
imposes a limitation that filters can target only leaf classes.

The speedup on Mellanox ConnectX-6 Dx was 14.2 times in the UDP
multi-stream test, compared to software HTB implementation (more details
in the mlx5 patch).

[1]: https://www.spinics.net/lists/netdev/msg628422.html
[2]: https://www.spinics.net/lists/netdev/msg663548.html

v2 changes:

Fixed sparse and smatch warnings. Formatted HTB patches to 80 chars per
line.

v3 changes:

Fixed the CI failure on parisc with 16-bit xchg by replacing it with
WRITE_ONCE. Fixed the capability bits in mlx5_ifc.h and the value of
MLX5E_QOS_MAX_LEAF_NODES.

v4 changes:

Check if HTB is root when offloading. Add extack for hardware errors.
Rephrase explanations of how it works in the commit message. Remove %hu
from format strings. Add resiliency when leaf_del_last fails to create a
new leaf node.
====================

Link: https://lore.kernel.org/r/20210119120815.463334-1-maximmi@mellanox.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/mlx5e: Support HTB offload

This commit adds support for HTB offload in the mlx5e driver.

Performance:

  NIC: Mellanox ConnectX-6 Dx
  CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz (24 cores with HT)

  100 Gbit/s line rate, 500 UDP streams @ ~200 Mbit/s each
  48 traffic classes, flower used for steering
  No shaping (rate limits set to 4 Gbit/s per TC) - checking for max
  throughput.

  Baseline: 98.7 Gbps, 8.25 Mpps
  HTB: 6.7 Gbps, 0.56 Mpps
  HTB offload: 95.6 Gbps, 8.00 Mpps

Limitations:

1. 256 leaf nodes, 3 levels of depth.

2. Granularity for ceil is 1 Mbit/s. Rates are converted to weights, and
the bandwidth is split among the siblings according to these weights.
Other parameters for classes are not supported.

Ethtool statistics support for QoS SQs are also added. The counters are
called qos_txN_*, where N is the QoS queue number (starting from 0, the
numeration is separate from the normal SQs), and * is the counter name
(the counters are the same as for the normal SQs).

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sch_htb: Stats for offloaded HTB

This commit adds support for statistics of offloaded HTB. Bytes and
packets counters for leaf and inner nodes are supported, the values are
taken from per-queue qdiscs, and the numbers that the user sees should
have the same behavior as the software (non-offloaded) HTB.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sch_htb: Hierarchical QoS hardware offload

HTB doesn't scale well because of contention on a single lock, and it
also consumes CPU. This patch adds support for offloading HTB to
hardware that supports hierarchical rate limiting.

In the offload mode, HTB passes control commands to the driver using
ndo_setup_tc. The driver has to replicate the whole hierarchy of classes
and their settings (rate, ceil) in the NIC. Every modification of the
HTB tree caused by the admin results in ndo_setup_tc being called.

After this setup, the HTB algorithm is done completely in the NIC. An SQ
(send queue) is created for every leaf class and attached to the
hierarchy, so that the NIC can calculate and obey aggregated rate
limits, too. In the future, it can be changed, so that multiple SQs will
back a single leaf class.

ndo_select_queue is responsible for selecting the right queue that
serves the traffic class of each packet.

The data path works as follows: a packet is classified by clsact, the
driver selects a hardware queue according to its class, and the packet
is enqueued into this queue's qdisc.

This solution addresses two main problems of scaling HTB:

1. Contention by flow classification. Currently the filters are attached
to the HTB instance as follows:

    # tc filter add dev eth0 parent 1:0 protocol ip flower dst_port 80
    classid 1:10

It's possible to move classification to clsact egress hook, which is
thread-safe and lock-free:

    # tc filter add dev eth0 egress protocol ip flower dst_port 80
    action skbedit priority 1:10

This way classification still happens in software, but the lock
contention is eliminated, and it happens before selecting the TX queue,
allowing the driver to translate the class to the corresponding hardware
queue in ndo_select_queue.

Note that this is already compatible with non-offloaded HTB and doesn't
require changes to the kernel nor iproute2.

2. Contention by handling packets. HTB is not multi-queue, it attaches
to a whole net device, and handling of all packets takes the same lock.
When HTB is offloaded, it registers itself as a multi-queue qdisc,
similarly to mq: HTB is attached to the netdev, and each queue has its
own qdisc.

Some features of HTB may be not supported by some particular hardware,
for example, the maximum number of classes may be limited, the
granularity of rate and ceil parameters may be different, etc. - so, the
offload is not enabled by default, a new parameter is used to enable it:

    # tc qdisc replace dev eth0 root handle 1: htb offload

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: Add extack to Qdisc_class_ops.delete

In a following commit, sch_htb will start using extack in the delete
class operation to pass hardware errors in offload mode. This commit
prepares for that by adding the extack parameter to this callback and
converting usage of the existing qdiscs.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: sched: Add multi-queue support to sch_tree_lock

The existing qdiscs that set TCQ_F_MQROOT don't use sch_tree_lock.
However, hardware-offloaded HTB will start setting this flag while also
using sch_tree_lock.

The current implementation of sch_tree_lock basically locks on
qdisc->dev_queue->qdisc, and it works fine when the tree is attached to
some queue. However, it's not the case for MQROOT qdiscs: such a qdisc
is the root itself, and its dev_queue just points to queue 0, while not
actually being used, because there are real per-queue qdiscs.

This patch changes the logic of sch_tree_lock and sch_tree_unlock to
lock the qdisc itself if it's the MQROOT.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'tcp-add-cmsg-rx-timestamps-to-rx-zerocopy'

Arjun Roy says:

====================
tcp: add CMSG+rx timestamps to rx. zerocopy

Provide CMSG and receive timestamp support to TCP
receive zerocopy. Patch 1 refactors CMSG pending state for
tcp_recvmsg() to avoid the use of magic numbers; patch 2 implements
receive timestamp via CMSG support for receive zerocopy, and uses the
constants added in patch 1.
====================

Link: https://lore.kernel.org/r/20210121004148.2340206-1-arjunroy.kdev@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: Add receive timestamp support for receive zerocopy.

tcp_recvmsg() uses the CMSG mechanism to receive control information
like packet receive timestamps. This patch adds CMSG fields to
struct tcp_zerocopy_receive, and provides receive timestamps
if available to the user.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: Remove CMSG magic numbers for tcp_recvmsg().

At present, tcp_recvmsg() uses flags to track if any CMSGs are pending
and what those CMSGs are. These flags are currently magic numbers,
used only within tcp_recvmsg().

To prepare for receive timestamp support in tcp receive zerocopy,
gently refactor these magic numbers into enums.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-bridge-multicast-add-initial-eht-support'

Nikolay Aleksandrov says:

====================
net: bridge: multicast: add initial EHT support

This set adds explicit host tracking support for IGMPv3/MLDv2. The
already present per-port fast leave flag is used to enable it since that
is the primary goal of EHT, to track a group and its S,Gs usage per-host
and when left without any interested hosts delete them before the standard
timers. The EHT code is pretty self-contained and not enabled by default.
There is no new uAPI added, all of the functionality is currently hidden
behind the fast leave flag. In the future that will change (more below).
The host tracking uses two new sets per port group: one having an entry for
each host which contains that host's view of the group (source list and
filter mode), and one set which contains an entry for each source having
an internal set which contains an entry for each host that has reported
an interest for that source. RB trees are used for all sets so they're
compact when not used and fast when we need to do lookups.
To illustrate it:
[ bridge port group ]
  ` [ host set (rb) ]
   ` [ host entry with a list of sources and filter mode ]
  ` [ source set (rb) ]
   ` [ source entry ]
    ` [ source host set (rb) ]
     ` [ source host entry with a timer ]

The number of tracked sources per host is limited to the maximum total
number of S,G entries per port group - PG_SRC_ENT_LIMIT (currently 32).
The number of hosts is unlimited, I think the argument that a local
attacker can exhaust the memory/cause high CPU usage can be applied to
fdb entries as well which are unlimited. In the future if needed we can
add an option to limit these, but I don't think it's necessary for a
start. All of the new sets are protected by the bridge's multicast lock.
I'm pretty sure we'll be changing the cases and improving the
convergence time in the future, but this seems like a good start.

Patch breakdown:
patch 1 -  4: minor cleanups and preparations for EHT
patch      5: adds the new structures which will be used in the
               following patches
patch      6: adds support to create, destroy and lookup host entries
patch      7: adds support to create, delete and lokup source set entries
patch      8: adds a host "delete" function which is just a host's
               source list flush since that would automatically delete
               the host
patch 9 - 10: add support for handling all IGMPv3/MLDv2 report types
               more information can be found in the individual patches
patch     11: optmizes a specific TO_INCLUDE use-case with host timeouts
patch     12: handles per-host filter mode changing (include <-> exclude)
patch     13: pulls out block group deletion since now it can be
               deleted in both filter modes
patch     14: marks deletions done due to fast leave

Future plans:
- export host information
- add an option to reduce queries
- add an option to limit the number of host entries
- tune more fast leave cases for quicker convergence

By the way I think this is the first open-source EHT implementation, I
couldn't find any while researching it. :)
====================

Link: https://lore.kernel.org/r/20210120145203.1109140-1-razor@blackwall.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: mark IGMPv3/MLDv2 fast-leave deletes

Mark groups which were deleted due to fast leave/EHT.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: handle block pg delete for all cases

A block report can result in empty source and host sets for both include
and exclude groups so if there are no hosts left we can safely remove
the group. Pull the block group handling so it can cover both cases and
add a check if EHT requires the delete.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: add EHT host filter_mode handling

We should be able to handle host filter mode changing. For exclude mode
we must create a zero-src entry so the group will be kept even without
any S,G entries (non-zero source sets). That entry doesn't count to the
entry limit and can always be created, its timer is refreshed on new
exclude reports and if we change the host filter mode to include then it
gets removed and we rely only on the non-zero source sets.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: optimize TO_INCLUDE EHT timeouts

This is an optimization specifically for TO_INCLUDE which sends queries
for the older entries and thus lowers the S,G timers to LMQT. If we have
the following situation for a group in either include or exclude mode:
- host A was interested in srcs X and Y, but is timing out
- host B sends TO_INCLUDE src Z, the bridge lowers X and Y's timeouts
   to LMQT
- host B sends BLOCK src Z after LMQT time has passed
=> since host B is the last host we can delete the group, but if we
    still have host A's EHT entries for X and Y (i.e. if they weren't
    lowered to LMQT previously) then we'll have to wait another LMQT
    time before deleting the group, with this optimization we can
    directly remove it regardless of the group mode as there are no more
    interested hosts

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: add EHT include and exclude handling

Add support for IGMPv3/MLDv2 include and exclude EHT handling. Similar to
how the reports are processed we have 2 cases when the group is in include
or exclude mode, these are processed as follows:
- group include
  - is_include: create missing entries
  - to_include: flush existing entries and create a new set from the
    report, obviously if the src set is empty then we delete the group

- group exclude
  - is_exclude: create missing entries
  - to_exclude: flush existing entries and create a new set from the
    report, any empty source set entries are removed

If the group is in a different mode then we just flush all entries reported
by the host and we create a new set with the new mode entries created from
the report. If the report is include type, the source list is empty and
the group has empty sources' set then we remove it. Any source set entries
which are empty are removed as well. If the group is in exclude mode it
can exist without any S,G entries (allowing for all traffic to pass).

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: add EHT allow/block handling

Add support for IGMPv3/MLDv2 allow/block EHT handling. Similar to how
the reports are processed we have 2 cases when the group is in include
or exclude mode, these are processed as follows:
- group include
  - allow: create missing entries
  - block: remove existing matching entries and remove the corresponding
    S,G entries if there are no more set host entries, then possibly
    delete the whole group if there are no more S,G entries

- group exclude
  - allow
    - host include: create missing entries
    - host exclude: remove existing matching entries and remove the
      corresponding S,G entries if there are no more set host entries
  - block
    - host include: remove existing matching entries and remove the
      corresponding S,G entries if there are no more set host entries,
      then possibly delete the whole group if there are no more S,G entries
    - host exclude: create missing entries

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: add EHT host delete function

Now that we can delete set entries, we can use that to remove EHT hosts.
Since the group's host set entries exist only when there are related
source set entries we just have to flush all source set entries
joined by the host set entry and it will be automatically removed.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: add EHT source set handling functions

Add EHT source set and set-entry create, delete and lookup functions.
These allow to manipulate source sets which contain their own host sets
with entries which joined that S,G. We're limiting the maximum number of
tracked S,G entries per host to PG_SRC_ENT_LIMIT (currently 32) which is
the current maximum of S,G entries for a group. There's a per-set timer
which will be used to destroy the whole set later.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: add EHT host handling functions

Add functions to create, destroy and lookup an EHT host. These are
per-host entries contained in the eht_host_tree in net_bridge_port_group
which are used to store a list of all sources (S,G) entries joined for that
group by each host, the host's current filter mode and total number of
joined entries.
No functional changes yet, these would be used in later patches.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: add EHT structures and definitions

Add EHT structures for tracking hosts and sources per group. We keep one
set for each host which has all of the host's S,G entries, and one set for
each multicast source which has all hosts that have joined that S,G. For
each host, source entry we record the filter_mode and we keep an expiry
timer. There is also one global expiry timer per source set, it is
updated with each set entry update, it will be later used to lower the
set's timer instead of lowering each entry's timer separately.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: calculate idx position without changing ptr

We need to preserve the srcs pointer since we'll be passing it for EHT
handling later.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: __grp_src_block_incl can modify pg

Prepare __grp_src_block_incl() for being able to cause a notification
due to changes. Currently it cannot happen, but EHT would change that
since we'll be deleting sources immediately. Make sure that if the pg is
deleted we don't return true as that would cause the caller to access
freed pg. This patch shouldn't cause any functional change.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: pass host src address to IGMPv3/MLDv2 functions

We need to pass the host address so later it can be used for explicit
host tracking. No functional change.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: bridge: multicast: rename src_size to addr_size

Rename src_size argument to addr_size in preparation for passing host
address as an argument to IGMPv3/MLDv2 functions.
No functional change.

Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: hns3: replace skb->csum_not_inet with skb_csum_is_sctp

Commit 3d985ffbf1cd ("net: add inline function skb_csum_is_sctp")
missed replacing skb->csum_not_inet check in hns3. This patch is
to replace it with skb_csum_is_sctp().

Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/3ad3c22c08beb0947f5978e790bd98d2aa063df9.1611307861.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-re-enable-sndbuf-autotune'

Paolo Abeni says:

====================
mptcp: re-enable sndbuf autotune

The sendbuffer autotuning was unintentionally disabled as a
side effect of the recent workqueue removal refactor. These
patches re-enable id, with some extra care: with autotuning
enable/large send buffer we need a more accurate packet
scheduler to be able to use efficiently the available
subflow bandwidth, especially when the subflows have
different capacities.

The first patch cleans-up subflow socket handling, making
the actual re-enable (patch 2) simpler.

Patches 3 and 4 improve the packet scheduler, to better cope
with non trivial scenarios and large send buffer.

Finally patch 5 adds and uses some infrastructure to avoid
the workqueue usage for the packet scheduler operations introduced
by the previous patches.
====================

Link: https://lore.kernel.org/r/cover.1611153172.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: implement delegated actions

On MPTCP-level ack reception, the packet scheduler
may select a subflow other then the current one.

Prior to this commit we rely on the workqueue to trigger
action on such subflow.

This changeset introduces an infrastructure that allows
any MPTCP subflow to schedule actions (MPTCP xmit) on
others subflows without resorting to (multiple) process
reschedule.

A dummy NAPI instance is used instead. When MPTCP needs to
trigger action an a different subflow, it enqueues the target
subflow on the NAPI backlog and schedule such instance as needed.

The dummy NAPI poll method walks the sockets backlog and tries
to acquire the (BH) socket lock on each of them. If the socket
is owned by the user space, the action will be completed by
the sock release cb, otherwise push is started.

This change leverages the delegated action infrastructure
to avoid invoking the MPTCP worker to spool the pending data,
when the packet scheduler picks a subflow other then the one
currently processing the incoming MPTCP-level ack.

Additionally we further refine the subflow selection
invoking the packet scheduler for each chunk of data
even inside __mptcp_subflow_push_pending().

v1 -> v2:
- fix possible UaF at shutdown time, resetting sock ops
after removing the ulp context

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: schedule work for better snd subflow selection

Otherwise the packet scheduler policy will not be
enforced when pushing pending data at MPTCP-level
ack reception time.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: do not queue excessive data on subflows

The current packet scheduler can enqueue up to sndbuf
data on each subflow. If the send buffer is large and
the subflows are not symmetric, this could lead to
suboptimal aggregate bandwidth utilization.

Limit the amount of queued data to the maximum send
window.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: re-enable sndbuf autotune

After commit 3bae9847f333 ("mptcp: use mptcp release_cb for
delayed tasks"), MPTCP never sets the flag bit SOCK_NOSPACE
on its subflow. As a side effect, autotune never takes place,
as it happens inside tcp_new_space(), which in turn is called
only when the mentioned bit is set.

Let's sendmsg() set the subflows NOSPACE bit when looking for
more memory and use the subflow write_space callback to propagate
the snd buf update and wake-up the user-space.

Additionally, this allows dropping a bunch of duplicate code and
makes the SNDBUF_LIMITED chrono relevant again for MPTCP subflows.

Fixes: 3bae9847f333 ("mptcp: use mptcp release_cb for delayed tasks")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: always graft subflow socket to parent

Currently, incoming subflows link to the parent socket,
while outgoing ones link to a per subflow socket. The latter
is not really needed, except at the initial connect() time and
for the first subflow.

Always graft the outgoing subflow to the parent socket and
free the unneeded ones early.

This allows some code cleanup, reduces the amount of memory
used and will simplify the next patch

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

sfc: reduce the number of requested xdp ev queues

Without this change the driver tries to allocate too many queues,
breaching the number of available msi-x interrupts on machines
with many logical cpus and default adapter settings:

Insufficient resources for 12 XDP event queues (24 other channels, max 32)

Which in turn triggers EINVAL on XDP processing:

sfc 0000:86:00.0 ext0: XDP TX failed (-22)

Signed-off-by: Ivan Babrou <ivan@cloudflare.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com>
Link: https://lore.kernel.org/r/20210120212759.81548-1-ivan@cloudflare.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: add TTL to SCM_TIMESTAMPING_OPT_STATS

This patch adds TCP_NLA_TTL to SCM_TIMESTAMPING_OPT_STATS that exports
the time-to-live or hop limit of the latest incoming packet with
SCM_TSTAMP_ACK. The value exported may not be from the packet that acks
the sequence when incoming packets are aggregated. Exporting the
time-to-live or hop limit value of incoming packets helps to estimate
the hop count of the path of the flow that may change over time.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Link: https://lore.kernel.org/r/20210120204155.552275-1-ysseung@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

tcp: remove unused ICSK_TIME_EARLY_RETRANS

Since the early retransmit has been removed by
commit eabe16c436ab ("tcp: remove early retransmit"),
we also remove the unused ICSK_TIME_EARLY_RETRANS macro.

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/1611239473-27304-1-git-send-email-yangpc@wangsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: phy: realtek: Add support for RTL9000AA/AN

RTL9000AA/AN as 100BASE-T1 is following:
- 100 Mbps
- Full duplex
- Link Status Change Interrupt
- Master/Slave configuration

Signed-off-by: Yuusuke Ashizuka <ashiduka@fujitsu.com>
Signed-off-by: Torii Kenichi <torii.ken1@fujitsu.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20210121080254.21286-1-ashiduka@fujitsu.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: renesas,etheravb: Add r8a779a0 support

Document the compatible value for the RAVB block in the Renesas R-Car
V3U (R8A779A0) SoC. This variant has no stream buffer, so we only need
to add the new compatible and add it to the TX delay block.

Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Rob Herring <robh@kernel.org>
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
Link: https://lore.kernel.org/r/20210121100619.5653-2-wsa+renesas@sang-engineering.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-ipa-remove-a-build-dependency'

Alex Elder says:

====================
net: ipa: remove a build dependency

Unlike the original (temporary) IPA notification mechanism, the
generic remoteproc SSR notification code does not require the IPA
driver to maintain a pointer to the modem subsystem remoteproc
structure.

The IPA driver was converted to use the newer SSR notifiers, but the
specification and use of a phandle for the modem subsystem was never
removed.

This series removes the lookup of the remoteproc pointer, and that
removes the need for the modem DT property. It also removes the
reference to the "modem-remoteproc" property from the DT binding,
and from the DT files that specified them.
====================

Link: https://lore.kernel.org/r/20210120212606.12556-1-elder@linaro.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

arm64: dts: qcom: sdm845: kill IPA modem-remoteproc property

The "modem-remoteproc" property is no longer required for the IPA
driver, so get rid of it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

arm64: dts: qcom: sc7180: kill IPA modem-remoteproc property

The "modem-remoteproc" property is no longer required for the IPA
driver, so get rid of it.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

dt-bindings: net: remove modem-remoteproc property

The IPA driver uses the remoteproc SSR notifier now, rather than the
temporary IPA notification system used initially. As a result it no
longer needs a property identifying the modem subsystem DT node.

Use GIC_SPI rather than 0 in the example interrupt definition.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ipa: remove a remoteproc dependency

The IPA driver currently requires a DT property to be defined whose
value is the phandle for the modem subsystem. This was needed to
look up a remoteproc structure pointer used when registering for
notifications in the original IPA notification mechanism.

Remoteproc provides a more generic SSR notifier system, and the IPA
driver switched over to it last summer, but this remoteproc phandle
dependency was not removed at that time.

Get rid of the IPA remoteproc pointer and stop requiring the phandle
be specified.

This avoids a link error (rproc_put() not defined) for certain
configurations.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: macb: ignore tx_clk if MII is used

If the MII interface is used, the PHY is the clock master, thus don't
set the clock rate. On Zynq-7000, this will prevent the following
warning:
macb e000b000.ethernet eth0: unable to generate target frequency: 25000000 Hz

Signed-off-by: Michael Walle <michael@walle.cc>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Link: https://lore.kernel.org/r/20210120194303.28268-1-michael@walle.cc
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: remove aurora nb8800 driver

The tango4 platform is getting removed, and this driver has no
other known users, so it can be removed.

Cc: Marc Gonzalez <marc.w.gonzalez@free.fr>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Mans Rullgard <mans@mansr.com>
Link: https://lore.kernel.org/r/20210120150703.1629527-1-arnd@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cxgb4: remove bogus CHELSIO_VPD_UNIQUE_ID constant

The comment is quite weird, there is no such thing as a vendor-specific
VPD id. 0x82 is the value of PCI_VPD_LRDT_ID_STRING. So what we are
doing here is simply checking whether the byte at VPD address VPD_BASE
is a valid string LRDT, same as what is done a few lines later in
the code.
LRDT = Large Resource Data Tag, see PCI 2.2 spec, VPD chapter

v2:
- don't set VPD_BASE / VPD_BASE_OLD separately

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://lore.kernel.org/r/644ef22f-e86a-5cc1-0f27-f873ab165696@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

cxgb4: Assign boolean values to a bool variable

Fix the following coccicheck warnings:

./drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:5142:2-33:
WARNING: Assignment of 0/1 to bool variable.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Zhong <abaci-bugfix@linux.alibaba.com>
Link: https://lore.kernel.org/r/1611126111-22079-1-git-send-email-abaci-bugfix@linux.alibaba.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

MAINTAINERS: add entry for Arrow SpeedChips XRS7000 driver

Add myself as maintainer of the Arrow SpeedChips XRS7000 series Ethernet
switch driver.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: George McCollister <george.mccollister@gmail.com>
Link: https://lore.kernel.org/r/20210120135323.73856-1-george.mccollister@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'ucc_geth-improvements'

Rasmus Villemoes says:

====================
ucc_geth improvements

This is a resend of some improvements to the ucc_geth driver that was
previously sent together with bug fixes, which have by now been
applied.

v2: rebase to net/master; address minor style issues; don't introduce
a use-after-free in patch "don't statically allocate eight
ucc_geth_info".
====================

Link: https://lore.kernel.org/r/20210119150802.19997-1-rasmus.villemoes@prevas.dk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: simplify rx/tx allocations

Since kmalloc() is nowadays [1] guaranteed to return naturally
aligned (i.e., aligned to the size itself) memory for power-of-2
sizes, we don't need to over-allocate the align amount, compute an
aligned address within the allocation, and (for later freeing) also
storing the original pointer [2].

Instead, just round up the length we want to allocate to the alignment
requirements, then round that up to the next power of 2. In theory,
this could allocate up to about twice as much memory as we needed. In
practice, (a) kmalloc() would in most cases anyway return a
power-of-2-sized allocation and (b) with the default values of the
bdRingLen[RT]x fields, the length is already itself a power of 2
greater than the alignment.

So we actually end up saving memory compared to the current
situtation (e.g. for tx, we currently allocate 128+32 bytes, which
kmalloc() likely rounds up to 192 or 256; with this patch, we just
allocate 128 bytes.) Also struct ucc_geth_private becomes a little
smaller.

[1] c416a55464dc ("mm, sl[aou]b: guarantee natural alignment for
kmalloc(power-of-two)")

[2] That storing was anyway done in a u32, which works on 32 bit
machines, but is not very elegant and certainly makes a reader of the
code pause for a while.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: inform the compiler that numQueues is always 1

The numQueuesTx and numQueuesRx members of struct ucc_geth_info are
never set to anything but 1, and never have been. It's unclear how
well the code supporting multiple queues would work. Until somebody
wants to play with enabling that, help the compiler eliminate a lot of
dead code and loops that are not really loops by creating static
inline helpers. If and when the numQueuesTx/numQueuesRx fields are
re-introduced, it suffices to update those helper to return the
appropriate field.

This cuts the .text segment of ucc_geth.o by 8%.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: add helper to replace repeated switch statements

The translation from the ucc_geth_num_of_threads enum value to the
actual count can be written somewhat more compactly with a small
lookup table, allowing us to replace the four switch statements.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: replace kmalloc_array()+for loop by kcalloc()

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: remove bd_mem_part and all associated code

The bd_mem_part member of ucc_geth_info always has the value
MEM_PART_SYSTEM, and AFAICT, there has never been any code setting it
to any other value. Moreover, muram is a somewhat precious resource,
so there's no point using that when normal memory serves just as well.

Apart from removing a lot of dead code, this is also motivated by
wanting to clean up the "store result from kmalloc() in a u32" mess.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: use UCC_GETH_{RX,TX}_BD_RING_ALIGNMENT macros directly

These macros both have the value 32, there's no point first
initializing align to a lower value.

If anything, one could throw in a
BUILD_BUG_ON(UCC_GETH_TX_BD_RING_ALIGNMENT < 4), but it's not worth it
- lots of code depends on named constants having sensible values.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: don't statically allocate eight ucc_geth_info

struct ucc_geth_info is somewhat large, and on systems with only one
or two UCC instances, that just wastes a few KB of memory. So
allocate and populate a chunk of memory at probe time instead of
initializing them all during driver init.

Note that the existing "ug_info == NULL" check was dead code, as the
address of some static array element can obviously never be NULL.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: constify ugeth_primary_info

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: factor out parsing of {rx,tx}-clock{,-name} properties

Reduce the code duplication a bit by moving the parsing of
rx-clock-name and the fallback handling to a helper function.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: remove {rx,tx}_glbl_pram_offset from struct ucc_geth_private

These fields are only used within ucc_geth_startup(), so they might as
well be local variables in that function rather than being stashed in
struct ucc_geth_private.

Aside from making that struct a tiny bit smaller, it also shortens
some lines (getting rid of pointless casts while here), and fixes the
problems with using IS_ERR_VALUE() on a u32 as explained in commit
ab81a24f4d28 ("soc: fsl: qe: change return type of cpm_muram_alloc()
to s32").

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: replace kmalloc+memset by kzalloc

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: remove unnecessary memset_io() calls

These buffers have all just been handed out from qe_muram_alloc(), aka
cpm_muram_alloc(), and the helper cpm_muram_alloc_common() already
does

memset_io(cpm_muram_addr(start), 0, size);

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: use qe_muram_free_addr()

This removes the explicit NULL checks, and allows us to stop storing
at least some of the _offset values separately.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

soc: fsl: qe: add cpm_muram_free_addr() helper

Add a helper that takes a virtual address rather than the muram
offset. This will be used in a couple of places to avoid having to
store both the offset and the virtual address, as well as removing
NULL checks from the callers.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Acked-by: Li Yang <leoyang.li@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

soc: fsl: qe: store muram_vbase as a void pointer instead of u8

The two functions cpm_muram_offset() and cpm_muram_dma() both need a
cast currently, one casts muram_vbase to do the pointer arithmetic on
void pointers, the other casts the passed-in address u8*.

It's simpler and more consistent to just always use void* and drop all
the casting.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Acked-by: Li Yang <leoyang.li@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

soc: fsl: qe: make cpm_muram_offset take a const void* argument

Allow passing const-qualified pointers without requiring a cast in the
caller.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Acked-by: Li Yang <leoyang.li@nxp.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ethernet: ucc_geth: remove unused read of temoder field

In theory, such a read-after-write might be required by the hardware,
but nothing in the data sheet suggests that to be the case. The name
test also suggests that it's some debug leftover.

Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'add-devlink-health-reporters-for-nix-block'

George Cherian says:

====================
Add devlink health reporters for NIX block

Devlink health reporters are added for the NIX block.

Address Jakub's comment to add devlink support for error reporting.
https://www.spinics.net/lists/netdev/msg670712.html

This series is in continuation to
https://www.spinics.net/lists/netdev/msg707798.html

Added Documentation for the same.
====================

Link: https://lore.kernel.org/r/20210119100120.2614730-1-george.cherian@marvell.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

docs: octeontx2: Add Documentation for NIX health reporters

Add devlink health reporter documentation for NIX block.

Signed-off-by: George Cherian <george.cherian@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

octeontx2-af: Add devlink health reporters for NIX

Add health reporters for RVU NIX block.
NIX Health reporters handle following HW event groups
- GENERAL events
- ERROR events
- RAS events
- RVU event

Output:

# devlink health
pci/0002:01:00.0:
   reporter hw_npa_intr
     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
   reporter hw_npa_gen
     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
   reporter hw_npa_err
     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
   reporter hw_npa_ras
     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
   reporter hw_nix_intr
     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
   reporter hw_nix_gen
     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
   reporter hw_nix_err
     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true
   reporter hw_nix_ras
     state healthy error 0 recover 0 grace_period 0 auto_recover true auto_dump true

# devlink health dump show pci/0002:01:00.0 reporter hw_nix_intr
  NIX_AF_RVU:
NIX RVU Interrupt Reg : 1
Unmap Slot Error
# devlink health dump show pci/0002:01:00.0 reporter hw_nix_gen
  NIX_AF_GENERAL:
NIX General Interrupt Reg : 1
Rx multicast pkt drop

Each reporter dump shows the Register value and the description of the cause.

Signed-off-by: Sunil Kovvuri Goutham <sgoutham@marvell.com>
Signed-off-by: Jerin Jacob <jerinj@marvell.com>
Signed-off-by: George Cherian <george.cherian@marvell.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: stmmac: dwmac-meson8b: fix the RX delay validation

When has_prg_eth1_rgmii_rx_delay is true then we support RX delays
between 0ps and 3000ps in 200ps steps. Swap the validation of the RX
delay based on the has_prg_eth1_rgmii_rx_delay flag so the 200ps check
is now applied correctly on G12A SoCs (instead of only allow 0ps or
2000ps on G12A, but 0..3000ps in 200ps steps on older SoCs which don't
support that).

Fixes: a72f17e94d1242 ("net: stmmac: dwmac-meson8b: add support for the RGMII RX delay on G12A")
Reported-by: Martijn van Deventer <martijn@martijnvandeventer.nl>
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Link: https://lore.kernel.org/r/20210119202424.591349-1-martin.blumenstingl@googlemail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

ip_gre: remove CRC flag from dev features in gre_gso_segment

This patch is to let it always do CRC checksum in sctp_gso_segment()
by removing CRC flag from the dev features in gre_gso_segment() for
SCTP over GRE, just as it does in Commit 4fc864594a4e ("udp: support
sctp over udp in skb_udp_tunnel_segment") for SCTP over UDP.

It could set csum/csum_start in GSO CB properly in sctp_gso_segment()
after that commit, so it would do checksum with gso_make_checksum()
in gre_gso_segment(), and Commit f60f35d5fc11 ("net: gre: recompute
gre csum for sctp over gre tunnels") can be reverted now.

Note that when need_csum is false, we can still leave CRC checksum
of SCTP to HW by not clearing this CRC flag if it's supported, as
Jakub and Alex noticed.

v1->v2:
  - improve the changelog.
  - fix "rev xmas tree" in varibles declaration.
v2->v3:
  - remove CRC flag from dev features only when need_csum is true.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/00439f24d5f69e2c6fa2beadc681d056c15c258f.1610772251.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

udp: not remove the CRC flag from dev features when need_csum is false

In __skb_udp_tunnel_segment(), when it's a SCTP over VxLAN/GENEVE
packet and need_csum is false, which means the outer udp checksum
doesn't need to be computed, csum_start and csum_offset could be
used by the inner SCTP CRC CSUM for SCTP HW CRC offload.

So this patch is to not remove the CRC flag from dev features when
need_csum is false.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Link: https://lore.kernel.org/r/1e81b700642498546eaa3f298e023fd7ad394f85.1610776757.git.lucien.xin@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/sched: cls_flower add CT_FLAGS_INVALID flag support

This patch add the TCA_FLOWER_KEY_CT_FLAGS_INVALID flag to
match the ct_state with invalid for conntrack.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Link: https://lore.kernel.org/r/1611045110-682-1-git-send-email-wenxu@ucloud.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'net-inline-rollback_registered-functions'

After recent changes to the error path of register_netdevice()
we no longer need a version of unregister_netdevice_many() which
does not set net_todo. We can inline the rollback_registered()
functions into respective unregister_netdevice() calls.

Link: https://lore.kernel.org/r/20210119202521.3108236-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: inline rollback_registered_many()

Similar to the change for rollback_registered() -
rollback_registered_many() was a part of unregister_netdevice_many()
minus the net_set_todo(), which is no longer needed.

Functionally this patch moves the list_empty() check back after:

BUG_ON(dev_boot_phase);
ASSERT_RTNL();

but I can't find any reason why that would be an issue.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: move rollback_registered_many()

Move rollback_registered_many() and add a temporary
forward declaration to make merging the code into
unregister_netdevice_many() easier to review.

No functional changes.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: inline rollback_registered()

rollback_registered() is a local helper, it's common for driver
code to call unregister_netdevice_queue(dev, NULL) when they
want to unregister netdevices under rtnl_lock. Inline
rollback_registered() and adjust the only remaining caller.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: move net_set_todo inside rollback_registered()

Commit 19dbffef8f48 ("[NET]: Fix free_netdev on register_netdev
failure.") moved net_set_todo() outside of rollback_registered()
so that rollback_registered() can be used in the failure path of
register_netdevice() but without risking a double free.

Since commit 7e09adf7d54b ("net: Fix inconsistent teardown and
release of private netdev state."), however, we have a better
way of handling that condition, since destructors don't call
free_netdev() directly.

After the change in commit 5a137c4f496a ("net: make free_netdev()
more lenient with unregistering devices") we can now move
net_set_todo() back.

Reviewed-by: Edwin Peer <edwin.peer@broadcom.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'nexthop-more-fine-grained-policies-for-netlink-message-validation'

Petr Machata says:

====================
nexthop: More fine-grained policies for netlink message validation

There is currently one policy that covers all attributes for next hop
object management. Actual validation is then done in code, which makes it
unobvious which attributes are acceptable when, and indeed that everything
is rejected as necessary.

In this series, split rtm_nh_policy to several policies that cover various
aspects of the next hop object configuration, and instead of open-coding
the validation, defer to nlmsg_parse(). This should make extending the next
hop code simpler as well, which will be relevant in near future for
resilient hashing implementation.

This was tested by running tools/testing/selftests/net/fib_nexthops.sh.
Additionally iproute2 was tweaked to issue "nexthop list id" as an
RTM_GETNEXTHOP dump request, instead of a straight get to test that
unexpected attributes are indeed rejected.
====================

Link: https://lore.kernel.org/r/cover.1611156111.git.petrm@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

nexthop: Specialize rtm_nh_policy

This policy is currently only used for creation of new next hops and new
next hop groups. Rename it accordingly and remove the two attributes that
are not valid in that context: NHA_GROUPS and NHA_MASTER.

For consistency with other policies, do not mention policy array size in
the declarator, and replace NHA_MAX for ARRAY_SIZE as appropriate.

Note that with this commit, NHA_MAX and __NHA_MAX are not used anymore.
Leave them in purely as a user API.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>