Íñigo Huguet [Thu, 3 Jun 2021 06:34:29 +0000 (08:34 +0200)]
net:cxgb3: replace tasklets with works
OFLD and CTRL TX queues can be stopped if there is no room in
their DMA rings. If this happens, they're tried to be restarted
later after having made some room in the corresponding ring.
The tasks of restarting these queues were triggered using
tasklets, but they can be replaced for workqueue works, getting
them out of softirq context.
This queues stop/restart probably doesn't happen often and they
can be quite lengthy because they try to send all pending skbs.
Moreover, given that probably the ring is not empty yet, so the
DMA still has work to do, we don't need to be so fast to justify
using tasklets/softirq instead of running in a thread.
Signed-off-by: Íñigo Huguet <ihuguet@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Thu, 3 Jun 2021 00:51:21 +0000 (17:51 -0700)]
net: tcp better handling of reordering then loss cases
This patch aims to improve the situation when reordering and loss are
ocurring in the same flight of packets.
Previously the reordering would first induce a spurious recovery, then
the subsequent ACK may undo the cwnd (based on the timestamps e.g.).
However the current loss recovery does not proceed to invoke
RACK to install a reordering timer. If some packets are also lost, this
may lead to a long RTO-based recovery. An example is
https://groups.google.com/g/bbr-dev/c/OFHADvJbTEI
The solution is to after reverting the recovery, always invoke RACK
to either mount the RACK timer to fast retransmit after the reordering
window, or restarts the recovery if new loss is identified. Hence
it is possible the sender may go from Recovery to Disorder/Open to
Recovery again in one ACK.
Reported-by: mingkun bian <bianmingkun@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Additionally replace other strncpy() uses, as it is considered deprecated:
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/202102150705.fdR6obB0-lkp@intel.com Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Kees Cook [Wed, 2 Jun 2021 20:27:41 +0000 (13:27 -0700)]
net: vlan: Avoid using strncpy()
Use strscpy_pad() instead of strncpy() which is considered deprecated:
https://www.kernel.org/doc/html/latest/process/deprecated.html#strncpy-on-nul-terminated-strings
Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Jun 2021 21:11:22 +0000 (14:11 -0700)]
Merge branch 'NVMeTCP-Offload-ULP'
Shai Malin says:
====================
NVMeTCP Offload ULP
With the goal of enabling a generic infrastructure that allows NVMe/TCP
offload devices like NICs to seamlessly plug into the NVMe-oF stack, this
patch series introduces the nvme-tcp-offload ULP host layer, which will
be a new transport type called "tcp-offload" and will serve as an
abstraction layer to work with vendor specific nvme-tcp offload drivers.
NVMeTCP offload is a full offload of the NVMeTCP protocol, this includes
both the TCP level and the NVMeTCP level.
The nvme-tcp-offload transport can co-exist with the existing tcp and
other transports. The tcp offload was designed so that stack changes are
kept to a bare minimum: only registering new transports.
All other APIs, ops etc. are identical to the regular tcp transport.
Representing the TCP offload as a new transport allows clear and manageable
differentiation between the connections which should use the offload path
and those that are not offloaded (even on the same device).
The nvme-tcp-offload layers and API compared to nvme-tcp and nvme-rdma:
* NVMe layer: *
[ nvme/nvme-fabrics/blk-mq ]
|
(nvme API and blk-mq API)
|
|
* Vendor agnostic transport layer: *
Performance:
============
With this implementation on top of the Marvell qedn driver (using the
Marvell FastLinQ NIC), we were able to demonstrate the following CPU
utilization improvement:
On AMD EPYC 7402, 2.80GHz, 28 cores:
- For 16K queued read IOs, 16jobs, 4qd (50Gbps line rate):
Improved the CPU utilization from 15.1% with NVMeTCP SW to 4.7% with
NVMeTCP offload.
On Intel(R) Xeon(R) Gold 5122 CPU, 3.60GHz, 16 cores:
- For 512K queued read IOs, 16jobs, 4qd (25Gbps line rate):
Improved the CPU utilization from 16.3% with NVMeTCP SW to 1.1% with
NVMeTCP offload.
In addition, we were able to demonstrate the following latency improvement:
- For 200K read IOPS (16 jobs, 16 qd, with fio rate limiter):
Improved the average latency from 105 usec with NVMeTCP SW to 39 usec
with NVMeTCP offload.
Improved the 99.99 tail latency from 570 usec with NVMeTCP SW to 91 usec
with NVMeTCP offload.
The end-to-end offload latency was measured from fio while running against
back end of null device.
Upstream plan:
==============
The RFC series "NVMeTCP Offload ULP and QEDN Device Driver"
https://lore.kernel.org/netdev/20210531225222.16992-1-smalin@marvell.com/
was designed in a modular way so that part 1 (nvme-tcp-offload) and
part 2 (qed) are independent and part 3 (qedn) depends on both parts 1+2.
- Part 1 (RFC patch 1-8): NVMeTCP Offload ULP
The nvme-tcp-offload patches, will be sent to
'linux-nvme@lists.infradead.org'.
- Part 2 (RFC patches 9-15): QED NVMeTCP Offload
The qed infrastructure, will be sent to 'netdev@vger.kernel.org'.
Once part 1 and 2 are accepted:
- Part 3 (RFC patches 16-27): QEDN NVMeTCP Offload
The qedn patches, will be sent to 'linux-nvme@lists.infradead.org'.
Marvell is fully committed to maintain, test, and address issues with
the new nvme-tcp-offload layer.
Usage:
======
With the Marvell NVMeTCP offload design, the network-device (qede) and the
offload-device (qedn) are paired on each port - Logically similar to the
RDMA model.
The user will interact with the network-device in order to configure
the ip/vlan. The NVMeTCP configuration is populated as part of the
nvme connect command.
Example:
Assign IP to the net-device (from any existing Linux tool):
ip addr add 100.100.0.101/24 dev p1p1
This IP will be used by both net-device (qede) and offload-device (qedn).
In order to connect from "sw" nvme-tcp through the net-device (qede):
nvme connect -t tcp -s 4420 -a 100.100.0.100 -n testnqn
In order to connect from "offload" nvme-tcp through the offload-device (qedn):
nvme connect -t tcp_offload -s 4420 -a 100.100.0.100 -n testnqn
An alternative approach, and as a future enhancement that will not impact this
series will be to modify nvme-cli with a new flag that will determine
if "-t tcp" should be the regular nvme-tcp (which will be the default)
or nvme-tcp-offload.
Exmaple:
nvme connect -t tcp -s 4420 -a 100.100.0.100 -n testnqn -[new flag]
Queue Initialization Design:
============================
The nvme-tcp-offload ULP module shall register with the existing
nvmf_transport_ops (.name = "tcp_offload"), nvme_ctrl_ops and blk_mq_ops.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following ops:
- claim_dev() - in order to resolve the route to the target according to
the paired net_dev.
- create_queue() - in order to create offloaded nvme-tcp queue.
The nvme-tcp-offload ULP module shall manage all the controller level
functionalities, call claim_dev and based on the return values shall call
the relevant module create_queue in order to create the admin queue and
the IO queues.
IO-path Design:
===============
The nvme-tcp-offload shall work at the IO-level - the nvme-tcp-offload
ULP module shall pass the request (the IO) to the nvme-tcp-offload vendor
driver and later, the nvme-tcp-offload vendor driver returns the request
completion (the IO completion).
No additional handling is needed in between; this design will reduce the
CPU utilization as we will describe below.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following IO-path ops:
- send_req() - in order to pass the request to the handling of the
offload driver that shall pass it to the vendor specific device.
- poll_queue()
Once the IO completes, the nvme-tcp-offload vendor driver shall call
command.done() that will invoke the nvme-tcp-offload ULP layer to
complete the request.
TCP events:
===========
The Marvell FastLinQ NIC HW engine handle all the TCP re-transmissions
and OOO events.
Teardown and errors:
====================
In case of NVMeTCP queue error the nvme-tcp-offload vendor driver shall
call the nvme_tcp_ofld_report_queue_err.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following teardown ops:
- drain_queue()
- destroy_queue()
The Marvell FastLinQ NIC HW engine:
====================================
The Marvell NIC HW engine is capable of offloading the entire TCP/IP
stack and managing up to 64K connections per PF, already implemented and
upstream use cases for this include iWARP (by the Marvell qedr driver)
and iSCSI (by the Marvell qedi driver).
In addition, the Marvell NIC HW engine offloads the NVMeTCP queue layer
and is able to manage the IO level also in case of TCP re-transmissions
and OOO events.
The HW engine enables direct data placement (including the data digest CRC
calculation and validation) and direct data transmission (including data
digest CRC calculation).
The Marvell qedn driver:
========================
The new driver will be added under "drivers/nvme/hw" and will be enabled
by the Kconfig "Marvell NVM Express over Fabrics TCP offload".
As part of the qedn init, the driver will register as a pci device driver
and will work with the Marvell fastlinQ NIC.
As part of the probe, the driver will register to the nvme_tcp_offload
(ULP) and to the qed module (qed_nvmetcp_ops) - similar to other
"qed_*_ops" which are used by the qede, qedr, qedf and qedi device
drivers.
Changes since RFC v2:
=====================
- nvme-tcp-offload: Fixes in controller and queue level (patches 3-6).
- qedn: Add the Marvell's NVMeTCP HW offload vendor driver init and probe
(patches 8-11).
Changes since RFC v3:
=====================
- nvme-tcp-offload: Add the full implementation of the nvme-tcp-offload layer
including the new ops: setup_ctrl(), release_ctrl(), commit_rqs() and new
flows (ASYNC and timeout).
- nvme-tcp-offload: Add device maximums: max_hw_sectors, max_segments.
- nvme-tcp-offload: layer design and optimization changes.
Changes since RFC v4:
=====================
(Many thanks to Hannes Reinecke for his feedback)
- nvme_tcp_offload: Add num_hw_vectors in order to limit the number of queues.
- nvme_tcp_offload: Add per device private_data.
- nvme_tcp_offload: Fix header digest, data digest and tos initialization.
Changes since RFC v5:
=====================
(Many thanks to Sagi Grimberg for his feedback)
- nvme-fabrics: Expose nvmf_check_required_opts() globally (as a new patch).
- nvme_tcp_offload: Remove io-queues BLK_MQ_F_BLOCKING.
- nvme_tcp_offload: Fix the nvme_tcp_ofld_stop_queue (drain_queue) flow.
- nvme_tcp_offload: Fix the nvme_tcp_ofld_free_queue (destroy_queue) flow.
- nvme_tcp_offload: Change rwsem to mutex.
- nvme_tcp_offload: remove redundant fields.
- nvme_tcp_offload: Remove the "new" from setup_ctrl().
- nvme_tcp_offload: Remove the init_req() and commit_rqs() ops.
- nvme_tcp_offload: Minor fixes in nvme_tcp_ofld_create_ctrl() ansd
nvme_tcp_ofld_free_queue().
- nvme_tcp_offload: Patch 8 (timeout and async) was squeashed into
patch 7 (io level).
Changes since RFC v6:
=====================
- No changes in nvme_tcp_offload (only in qedn).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dean Balandin [Wed, 2 Jun 2021 18:42:46 +0000 (21:42 +0300)]
nvme-tcp-offload: Add IO level implementation
In this patch, we present the IO level functionality.
The nvme-tcp-offload shall work on the IO-level, meaning the
nvme-tcp-offload ULP module shall pass the request to the nvme-tcp-offload
vendor driver and shall expect for the request completion.
No additional handling is needed in between, this design will reduce the
CPU utilization as we will describe below.
The nvme-tcp-offload vendor driver shall register to nvme-tcp-offload ULP
with the following IO-path ops:
- send_req - in order to pass the request to the handling of the offload
driver that shall pass it to the vendor specific device
- poll_queue
The vendor driver will manage the context from which the request will be
executed and the request aggregations.
Once the IO completed, the nvme-tcp-offload vendor driver shall call
command.done() that shall invoke the nvme-tcp-offload ULP layer for
completing the request.
This patch also add support for the nvme-tcp-offload timeout and
nvme-tcp-offload ASYNC flow.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Dean Balandin <dbalandin@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dean Balandin [Wed, 2 Jun 2021 18:42:45 +0000 (21:42 +0300)]
nvme-tcp-offload: Add queue level implementation
In this patch we implement queue level functionality.
The implementation is similar to the nvme-tcp module, the main
difference being that we call the vendor specific create_queue op which
creates the TCP connection, and NVMeTPC connection including
icreq+icresp negotiation.
Once create_queue returns successfully, we can move on to the fabrics
connect.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Dean Balandin <dbalandin@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In this patch, we implement controller level error handling and recovery.
Upon an error discovered by the ULP or reset controller initiated by the
nvme-core (using reset_ctrl workqueue), the ULP will initiate a controller
recovery which includes teardown and re-connect of all queues.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Arie Gershberg <agershberg@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In this patch we implement controller level functionality including:
- create_ctrl.
- delete_ctrl.
- free_ctrl.
The implementation is similar to other nvme fabrics modules, the main
difference being that the nvme-tcp-offload ULP calls the vendor specific
claim_dev() op with the given TCP/IP parameters to determine which device
will be used for this controller.
Once found, the vendor specific device and controller will be paired and
kept in a controller list managed by the ULP.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Arie Gershberg <agershberg@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dean Balandin [Wed, 2 Jun 2021 18:42:42 +0000 (21:42 +0300)]
nvme-tcp-offload: Add device scan implementation
As part of create_ctrl(), it scans the registered devices and calls
the claim_dev op on each of them, to find the first devices that matches
the connection params. Once the correct devices is found (claim_dev
returns true), we raise the refcnt of that device and return that device
as the device to be used for ctrl currently being created.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Dean Balandin <dbalandin@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
nvme-fabrics: Move NVMF_ALLOWED_OPTS and NVMF_REQUIRED_OPTS definitions
Move NVMF_ALLOWED_OPTS and NVMF_REQUIRED_OPTS definitions
to header file, so it can be used by the different HW devices.
NVMeTCP offload devices might have different limitations of the
allowed options, for example, a device that does not support all the
queue types. With tcp and rdma, only the nvme-tcp and nvme-rdma layers
handle those attributes and the HW devices do not create any limitations
for the allowed options.
An alternative design could be to add separate fields in
nvme_tcp_ofld_ops such as max_hw_sectors and max_segments that
we already have in this series.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Arie Gershberg <agershberg@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Acked-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch will present the structure for the NVMeTCP offload common
layer driver. This module is added under "drivers/nvme/host/" and future
offload drivers which will register to it will be placed under
"drivers/nvme/hw".
This new driver will be enabled by the Kconfig "NVM Express over Fabrics
TCP offload commmon layer".
In order to support the new transport type, for host mode, no change is
needed.
Each new vendor-specific offload driver will register to this ULP during
its probe function, by filling out the nvme_tcp_ofld_dev->ops and
nvme_tcp_ofld_dev->private_data and calling nvme_tcp_ofld_register_dev
with the initialized struct.
The internal implementation:
- tcp-offload.h:
Includes all common structs and ops to be used and shared by offload
drivers.
- tcp-offload.c:
Includes the init function which registers as a NVMf transport just
like any other transport.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Dean Balandin <dbalandin@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Wed, 2 Jun 2021 17:44:26 +0000 (13:44 -0400)]
tipc: simplify handling of lookup scope during multicast message reception
We introduce a new macro TIPC_ANY_SCOPE to make the handling of the
lookup scope value more comprehensible during multicast reception.
The (unchanged) rules go as follows:
1) Multicast messages sent from own node are delivered to all matching
sockets on the own node, irrespective of their binding scope.
2) Multicast messages sent from other nodes arrive here because they
have found TIPC_CLUSTER_SCOPE bindings emanating from this node.
Those messages should be delivered to exactly those sockets, but not
to local sockets bound with TIPC_NODE_SCOPE, since the latter
obviously were not meant to be visible for those senders.
3) Group multicast/broadcast messages are delivered to the sockets with
a binding scope matching exactly the lookup scope indicated in the
message header, and nobody else.
Reviewed-by: Xin Long <lucien.xin@gmail.com> Tested-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Wed, 2 Jun 2021 17:44:25 +0000 (13:44 -0400)]
tipc: refactor function tipc_sk_anc_data_recv()
We refactor tipc_sk_anc_data_recv() to make it slightly more
comprehensible, but also to facilitate application of some additions
to the code in a future commit.
Reviewed-by: Xin Long <lucien.xin@gmail.com> Tested-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Maloy [Wed, 2 Jun 2021 17:44:24 +0000 (13:44 -0400)]
tipc: eliminate redundant fields in struct tipc_sock
We eliminate the redundant fields conn_type and conn_instance in
struct tipc_sock. On the connecting side, this information is already
present in the unused (after the connection is established) part of
the pre-allocated header, and on the accepting side, we put it there
when the new socket is created.
Reviewed-by: Xin Long <lucien.xin@gmail.com> Tested-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Jun 2021 21:04:18 +0000 (14:04 -0700)]
Merge branch 'QED-NVMeTCP-Offload'
Shai Malin says:
====================
QED NVMeTCP Offload
Intro:
======
This is the qed part of Marvell’s NVMeTCP offload series, shared as
RFC series "NVMeTCP Offload ULP and QEDN Device Drive".
This part is a standalone series, and is not dependent on other parts
of the RFC.
The overall goal is to add qedn as the offload driver for NVMeTCP,
alongside the existing offload drivers (qedr, qedi and qedf for rdma,
iscsi and fcoe respectively).
In this series we are making the necessary changes to qed to enable this
by exposing APIs for FW/HW initializations.
The qedn series (and required changes to NVMe stack) will be sent to the
linux-nvme mailing list.
I have included more details on the upstream plan under section with the
same name below.
The Series Patches:
===================
1. qed: Add TCP_ULP FW resource layout – replacing iSCSI when common
with NVMeTCP.
2. qed: Add NVMeTCP Offload PF Level FW and HW HSI.
3. qed: Add NVMeTCP Offload Connection Level FW and HW HSI.
4. qed: Add support of HW filter block – enables redirecting NVMeTCP
traffic to the dedicated PF.
5. qed: Add NVMeTCP Offload IO Level FW and HW HSI.
6. qed: Add NVMeTCP Offload IO Level FW Initializations.
7. qed: Add IP services APIs support –VLAN, IP routing and reserving
TCP ports for the offload device.
The NVMeTCP Offload:
====================
With the goal of enabling a generic infrastructure that allows NVMe/TCP
offload devices like NICs to seamlessly plug into the NVMe-oF stack, this
patch series introduces the nvme-tcp-offload ULP host layer, which will
be a new transport type called "tcp-offload" and will serve as an
abstraction layer to work with vendor specific nvme-tcp offload drivers.
NVMeTCP offload is a full offload of the NVMeTCP protocol, this includes
both the TCP level and the NVMeTCP level.
The nvme-tcp-offload transport can co-exist with the existing tcp and
other transports. The tcp offload was designed so that stack changes are
kept to a bare minimum: only registering new transports.
All other APIs, ops etc. are identical to the regular tcp transport.
Representing the TCP offload as a new transport allows clear and manageable
differentiation between the connections which should use the offload path
and those that are not offloaded (even on the same device).
The nvme-tcp-offload layers and API compared to nvme-tcp and nvme-rdma:
* NVMe layer: *
[ nvme/nvme-fabrics/blk-mq ]
|
(nvme API and blk-mq API)
|
|
* Vendor agnostic transport layer: *
Performance:
============
With this implementation on top of the Marvell qedn driver (using the
Marvell FastLinQ NIC), we were able to demonstrate the following CPU
utilization improvement:
On AMD EPYC 7402, 2.80GHz, 28 cores:
- For 16K queued read IOs, 16jobs, 4qd (50Gbps line rate):
Improved the CPU utilization from 15.1% with NVMeTCP SW to 4.7% with
NVMeTCP offload.
On Intel(R) Xeon(R) Gold 5122 CPU, 3.60GHz, 16 cores:
- For 512K queued read IOs, 16jobs, 4qd (25Gbps line rate):
Improved the CPU utilization from 16.3% with NVMeTCP SW to 1.1% with
NVMeTCP offload.
In addition, we were able to demonstrate the following latency improvement:
- For 200K read IOPS (16 jobs, 16 qd, with fio rate limiter):
Improved the average latency from 105 usec with NVMeTCP SW to 39 usec
with NVMeTCP offload.
Improved the 99.99 tail latency from 570 usec with NVMeTCP SW to 91 usec
with NVMeTCP offload.
The end-to-end offload latency was measured from fio while running against
back end of null device.
The Marvell FastLinQ NIC HW engine:
====================================
The Marvell NIC HW engine is capable of offloading the entire TCP/IP
stack and managing up to 64K connections per PF, already implemented and
upstream use cases for this include iWARP (by the Marvell qedr driver)
and iSCSI (by the Marvell qedi driver).
In addition, the Marvell NIC HW engine offloads the NVMeTCP queue layer
and is able to manage the IO level also in case of TCP re-transmissions
and OOO events.
The HW engine enables direct data placement (including the data digest CRC
calculation and validation) and direct data transmission (including data
digest CRC calculation).
The Marvell qedn driver:
========================
The new driver will be added under "drivers/nvme/hw" and will be enabled
by the Kconfig "Marvell NVM Express over Fabrics TCP offload".
As part of the qedn init, the driver will register as a pci device driver
and will work with the Marvell fastlinQ NIC.
As part of the probe, the driver will register to the nvme_tcp_offload
(ULP) and to the qed module (qed_nvmetcp_ops) - similar to other
"qed_*_ops" which are used by the qede, qedr, qedf and qedi device
drivers.
Upstream Plan:
=============
The RFC series "NVMeTCP Offload ULP and QEDN Device Driver"
https://lore.kernel.org/netdev/20210531225222.16992-1-smalin@marvell.com/
was designed in a modular way so that part 1 (nvme-tcp-offload) and
part 2 (qed) are independent and part 3 (qedn) depends on both parts 1+2.
- Part 1 (RFC patch 1-8): NVMeTCP Offload ULP
The nvme-tcp-offload patches, will be sent to
'linux-nvme@lists.infradead.org'.
- Part 2 (RFC patches 9-15): QED NVMeTCP Offload
The qed infrastructure, will be sent to 'netdev@vger.kernel.org'.
Once part 1 and 2 are accepted:
- Part 3 (RFC patches 16-27): QEDN NVMeTCP Offload
The qedn patches, will be sent to 'linux-nvme@lists.infradead.org'.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Nikolay Assa [Wed, 2 Jun 2021 17:16:55 +0000 (20:16 +0300)]
qed: Add IP services APIs support
This patch introduces APIs which the NVMeTCP Offload device (qedn)
will use through the paired net-device (qede).
It includes APIs for:
- ipv4/ipv6 routing
- get VLAN from net-device
- TCP ports reservation
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Nikolay Assa <nassa@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch introduces the NVMeTCP FW initializations which is used
to initialize the IO level configuration into a per IO HW
resource ("task") as part of the IO path flow.
Shai Malin [Wed, 2 Jun 2021 17:16:53 +0000 (20:16 +0300)]
qed: Add NVMeTCP Offload IO Level FW and HW HSI
This patch introduces the NVMeTCP Offload FW and HW HSI in order
to initialize the IO level configuration into a per IO HW
resource ("task") as part of the IO path flow.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Shai Malin [Wed, 2 Jun 2021 17:16:51 +0000 (20:16 +0300)]
qed: Add NVMeTCP Offload Connection Level FW and HW HSI
This patch introduces the NVMeTCP HSI and HSI functionality in order to
initialize and interact with the HW device as part of the connection level
HSI.
This includes:
- Connection offload: offload a TCP connection to the FW.
- Connection update: update the ICReq-ICResp params
- Connection clear SQ: outstanding IOs FW flush.
- Connection termination: terminate the TCP connection and flush the FW.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Shai Malin [Wed, 2 Jun 2021 17:16:50 +0000 (20:16 +0300)]
qed: Add NVMeTCP Offload PF Level FW and HW HSI
This patch introduces the NVMeTCP device and PF level HSI and HSI
functionality in order to initialize and interact with the HW device.
The patch also adds qed NVMeTCP personality.
This patch is based on the qede, qedr, qedi, qedf drivers HSI.
Acked-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: Dean Balandin <dbalandin@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Omkar Kulkarni <okulkarni@marvell.com> Signed-off-by: Shai Malin <smalin@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
In two places the 'ep_desc' and 'skb' local variables are used only
within if() or for() block, so they scope can be reduced which makes the
entire code slightly easier to follow. No functional change.
Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
nfc: mrvl: remove useless "continue" at end of loop
The "continue" statement at the end of a for loop does not have an
effect. Entire loop contents can be slightly simplified to increase
code readability. No functional change.
Suggested-by: Joe Perches <joe@perches.com> Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Julian Wiedmann [Wed, 2 Jun 2021 08:56:26 +0000 (10:56 +0200)]
net/smc: no need to flush smcd_dev's event_wq before destroying it
destroy_workqueue() already calls drain_workqueue(), which is a stronger
variant of flush_workqueue().
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Karsten Graul [Wed, 2 Jun 2021 08:56:25 +0000 (10:56 +0200)]
net/smc: avoid possible duplicate dmb unregistration
smc_lgr_cleanup() calls smcd_unregister_all_dmbs() as part of the link
group termination process. This is a leftover from the times when
smc_lgr_cleanup() scheduled a worker to actually free the link group.
Nowadays smc_lgr_cleanup() directly calls smc_lgr_free() without any
delay so an earlier dmb unregistration is no longer needed.
So remove smcd_unregister_all_dmbs() and the call to it.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 3 Jun 2021 20:30:44 +0000 (13:30 -0700)]
Merge branch 'xpcs-phylink_pcs_ops'
Vladimir Oltean says:
====================
Convert xpcs to phylink_pcs_ops
Background: the sja1105 DSA driver currently drives a Designware XPCS
for SGMII and 2500base-X, and it would be nice to reuse some code with
the xpcs module. This would also help consolidate the phylink_pcs_ops,
since the only other dedicated PCS driver, currently, is the lynx_pcs.
Therefore, this series makes the xpcs expose the same kind of API that
the lynx_pcs module does. The main changes are getting rid of struct
mdio_xpcs_ops, being compatible with struct phylink_pcs_ops and being
less reliant on the phy_interface_t passed to xpcs_probe (now renamed to
xpcs_create).
This patch series is partially tested (some code paths have been covered
on the NXP SJA1105 and some others with the help of Vee Khee Wong on
Intel Tiger Lake / stmmac) but further testing on 10G setups would be
appreciated, if possible.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Jun 2021 16:20:19 +0000 (19:20 +0300)]
net: pcs: xpcs: convert to phylink_pcs_ops
Since all the remaining members of struct mdio_xpcs_ops have direct
equivalents in struct phylink_pcs_ops, it is about time we remove it
altogether.
Since the phylink ops return void, we need to remove the error
propagation from the various xpcs methods and simply print an error
message where appropriate.
Since xpcs_get_state_c73() detects link faults and attempts to reset the
link on its own by calling xpcs_config(), but xpcs_config() now has a
lot of phylink arguments which are not needed and cannot be simply
fabricated by anybody else except phylink, the actual implementation has
been moved into a smaller xpcs_do_config().
The const struct mdio_xpcs_ops *priv->hw->xpcs has been removed, so we
need to look at the struct mdio_xpcs_args pointer now as an indication
whether the port has an XPCS or not.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Jun 2021 16:20:18 +0000 (19:20 +0300)]
net: pcs: xpcs: convert to mdio_device
Unify the 2 existing PCS drivers (lynx and xpcs) by doing a similar
thing on probe, which is to have a *_create function that takes a
struct mdio_device * given by the caller, and builds a private PCS
structure around that.
This changes stmmac to hold only a pointer to the xpcs, as opposed to
the full structure. This will be used in the next patch when struct
mdio_xpcs_ops is removed. Currently a pointer to struct mdio_xpcs_ops
is used as a shorthand to determine whether the port has an XPCS or not.
We can do the same now with the mdio_xpcs_args pointer.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
priv->hw->xpcs is of the type "const struct mdio_xpcs_ops *" and is used
as a placeholder/synonym for priv->plat->mdio_bus_data->has_xpcs. It is
done that way because the mdio_bus_data pointer might or might not be
populated in all stmmac instantiations.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Jun 2021 16:20:14 +0000 (19:20 +0300)]
net: pcs: xpcs: export xpcs_validate
Calling a function pointer with a single implementation through
struct mdio_xpcs_ops is clunky, and the stmmac_do_callback system forces
this to return int, even though it always returns zero.
Simply remove the "validate" function pointer from struct mdio_xpcs_ops
and replace it with an exported xpcs_validate symbol which is called
directly by stmmac.
priv->hw->xpcs is of the type "const struct mdio_xpcs_ops *" and is used
as a placeholder/synonym for priv->plat->mdio_bus_data->has_xpcs. It is
done that way because the mdio_bus_data pointer might or might not be
populated in all stmmac instantiations.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Jun 2021 16:20:13 +0000 (19:20 +0300)]
net: pcs: xpcs: make the checks related to the PHY interface mode stateless
The operating mode of the driver is currently to populate its
struct mdio_xpcs_args::supported and struct mdio_xpcs_args::an_mode
statically in xpcs_probe(), based on the passed phy_interface_t,
and work with those.
However this is not the operation that phylink expects from a PCS
driver, because the port might be attached to an SFP cage that triggers
changes of the phy_interface_t dynamically as one SFP module is
unpluggged and another is plugged.
To migrate towards that model, the struct mdio_xpcs_args should not
cache anything related to the phy_interface_t, but just look up the
statically defined, const struct xpcs_compat structure corresponding to
the detected PCS OUI/model number.
So we delete the "supported" and "an_mode" members of struct
mdio_xpcs_args, and add the "id" structure there (since the ID is not
expected to change at runtime).
Since xpcs->supported is used deep in the code in _xpcs_config_aneg_c73(),
we need to modify some function headers to pass the xpcs_compat from all
callers. In turn, the xpcs_compat is always supplied externally to the
xpcs module:
- Most of the time by phylink
- In xpcs_probe() it is needed because xpcs_soft_reset() writes to
MDIO_MMD_PCS or to MDIO_MMD_VEND2 depending on whether an_mode is clause
37 or clause 73. In order to not introduce functional changes related
to when the soft reset is issued, we continue to require the initial
phy_interface_t argument to be passed to xpcs_probe() so we can pass
this on to xpcs_soft_reset().
- stmmac_open() wants to know whether to call stmmac_init_phy() or not,
and for that it looks inside xpcs->an_mode, because the clause 73
(backplane) AN modes supposedly do not have a PHY. Because we moved
an_mode outside of struct mdio_xpcs_args, this is now no longer
directly possible, so we introduce a helper function xpcs_get_an_mode()
which protects the data encapsulation of the xpcs module and requires
a phy_interface_t to be passed as argument. This function can look up
the appropriate compat based on the phy_interface_t.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 2 Jun 2021 16:20:12 +0000 (19:20 +0300)]
net: pcs: xpcs: there is only one PHY ID
The xpcs driver has an apparently inadequate structure for the actual
hardware it drives.
These defines and the xpcs_probe() function would suggest that there is
one PHY ID per supported PHY interface type, and the driver simply
validates whether the mode it should operate in (the argument of
xpcs_probe) matches what the hardware is capable of:
but that is not the case, because upon closer inspection, all the above
4 PHY ID definitions are in fact equal.
So it is the same XPCS that is compatible with all 4 sets of PHY
interface types.
This change introduces an array of struct xpcs_compat which is populated
by the single struct xpcs_id instance. It also eliminates the bogus
defines for multiple Synopsys XPCS PHY IDs and replaces them with a
single XPCS_ID, which better reflects the way in which the hardware
operates.
Because we are touching this area of the code anyway, the new array of
struct xpcs_compat, as well as the array of xpcs_id, have been moved
towards the end of the file, since they are variable declarations not
definitions. If whichever of struct xpcs_compat or struct xpcs_id need
to gain a function pointer member in the future, it is easier to
reference functions (no forward declarations needed) if we have the
const variable declarations at the end of the file.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Wed, 2 Jun 2021 11:01:16 +0000 (19:01 +0800)]
net: hdlc_cisco: remove redundant space
Space prohibited between function name and open parenthesis '('.
Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Wed, 2 Jun 2021 11:01:15 +0000 (19:01 +0800)]
net: hdlc_cisco: add blank line after declaration
This patch fixes the checkpatch error about missing a blank line
after declarations.
Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Wed, 2 Jun 2021 11:01:14 +0000 (19:01 +0800)]
net: hdlc_cisco: remove unnecessary out of memory message
This patch removes unnecessary out of memory message,
to fix the following checkpatch.pl warning:
"WARNING: Possible unnecessary 'out of memory' message"
Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Wed, 2 Jun 2021 11:01:13 +0000 (19:01 +0800)]
net: hdlc_cisco: add some required spaces
Add spaces required after the close parenthesis '}'.
Add spaces required after that ','.
Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Wed, 2 Jun 2021 11:01:12 +0000 (19:01 +0800)]
net: hdlc_cisco: fix the code style issue about "foo* bar"
Fix the checkpatch error as "foo* bar" and should be "foo *bar",
and "(foo*)" should be "(foo *)".
Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peng Li [Wed, 2 Jun 2021 11:01:11 +0000 (19:01 +0800)]
net: hdlc_cisco: remove redundant blank lines
This patch removes some redundant blank lines.
Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Jun 2021 21:08:37 +0000 (14:08 -0700)]
Merge branch 'devlink-rate-objects'
Dmytro Linkin says:
====================
devlink: rate objects API
Resending without RFC.
Currently kernel provides a way to change tx rate of single VF in
switchdev mode via tc-police action. When lots of VFs are configured
management of theirs rates becomes non-trivial task and some grouping
mechanism is required. Implementing such grouping in tc-police will bring
flow related limitations and unwanted complications, like:
- tc-police is a policer and there is a user request for a traffic
shaper, so shared tc-police action is not suitable;
- flows requires net device to be placed on, means "groups" wouldn't
have net device instance itself. Taking into the account previous
point was reviewed a sollution, when representor have a policer and
the driver use a shaper if qdisc contains group of VFs - such approach
ugly, compilated and misleading;
- TC is ingress only, while configuring "other" side of the wire looks
more like a "real" picture where shaping is outside of the steering
world, similar to "ip link" command;
According to that devlink is the most appropriate place.
This series introduces devlink API for managing tx rate of single devlink
port or of a group by invoking callbacks (see below) of corresponding
driver. Also devlink port or a group can be added to the parent group,
where driver responsible to handle rates of a group elements. To achieve
all of that new rate object is added. It can be one of the two types:
- leaf - represents a single devlink port; created/destroyed by the
driver and bound to the devlink port. As example, some driver may
create leaf rate object for every devlink port associated with VF.
Since leaf have 1to1 mapping to it's devlink port, in user space it is
referred as pci/<bus_addr>/<port_index>;
- node - represents a group of rate objects; created/deleted by request
from the userspace; initially empty (no rate objects added). In
userspace it is referred as pci/<bus_addr>/<node_name>, where node name
can be any, except decimal number, to avoid collisions with leafs.
devlink_ops extended with following callbacks:
- rate_{leaf|node}_tx_{share|max}_set
- rate_node_{new|del}
- rate_{leaf|node}_parent_set
KAPI provides:
- creation/destruction of the leaf rate object associated with devlink
port
- destruction of rate nodes to allow a vendor driver to free allocated
resources on driver removal or due to the other reasons when nodes
destruction required
UAPI provides:
- dumping all or single rate objects
- setting tx_{share|max} of rate object of any type
- creating/deleting node rate object
- setting/unsetting parent of any rate object
Added devlink rate object support for netdevsim driver
Issues/open questions:
- Does user need DEVLINK_CMD_RATE_DEL_ALL_CHILD command to clean all
children of particular parent node? For example:
$ devlink port function rate flush netdevsim/netdevsim10/group
- priv pointer passed to the callbacks is a source of bugs; in leaf case
driver can embed rate object into internal structure and use
container_of() on it; in node case it cannot be done since nodes are
created from userspace
v1->v2:
- fixed kernel-doc for devlink_rate_leaf_{create|destroy}()
- s/func/function/ for all devlink port command occurences
v2->v3:
- devlink:
- added devlink_rate_nodes_destroy() function
- netdevsim:
- added call of devlink_rate_nodes_destroy() function
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:29 +0000 (15:17 +0300)]
netdevsim: Allow setting parent node of rate objects
Implement new devlink ops that allow setting rate node as a parent for
devlink port (leaf) or another devlink node through devlink API.
Expose parent names to netdevsim debugfs in read only mode.
Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:28 +0000 (15:17 +0300)]
devlink: Allow setting parent node of rate objects
Refactor DEVLINK_CMD_RATE_{GET|SET} command handlers to support setting
a node as a parent for another rate object (leaf or node) by means of
new attribute DEVLINK_ATTR_RATE_PARENT_NODE_NAME. Extend devlink ops
with new callbacks rate_{leaf|node}_parent_set() to set node as a parent
for rate object to allow supporting drivers to implement rate grouping
through devlink. Driver implementations are allowed to support leafs
or node children only. Invoking callback with NULL as parent should be
threated by the driver as unset parent action.
Extend rate object struct with reference counter to disallow deleting a
node with any child pointing to it. User should unset parent for the
child explicitly.
Example:
$ devlink port function rate add netdevsim/netdevsim10/group1
$ devlink port function rate add netdevsim/netdevsim10/group2
$ devlink port function rate set netdevsim/netdevsim10/group1 parent group2
$ devlink port function rate show netdevsim/netdevsim10/group1
netdevsim/netdevsim10/group1: type node parent group2
$ devlink port function rate set netdevsim/netdevsim10/group1 noparent
Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:26 +0000 (15:17 +0300)]
netdevsim: Implement support for devlink rate nodes
Implement new devlink ops that allow creation, deletion and setting of
shared/max tx rate of devlink rate nodes through devlink API.
Expose rate node and it's tx rates to netdevsim debugfs.
Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:25 +0000 (15:17 +0300)]
devlink: Introduce rate nodes
Implement support for DEVLINK_CMD_RATE_{NEW|DEL} commands that are used
to create and delete devlink rate nodes. Add new attribute
DEVLINK_ATTR_RATE_NODE_NAME that specify node name string. The node name
is an alphanumeric identifier. No valid node name can be a devlink port
index, eg. decimal number. Extend devlink ops with new callbacks
rate_node_{new|del}() and rate_node_tx_{share|max}_set() to allow
supporting drivers to implement ports rate grouping and setting tx rate
of rate nodes through devlink.
Expose devlink_rate_nodes_destroy() function to allow vendor driver do
proper cleanup of internally allocated resources for the nodes if the
driver goes down or due to any other reasons which requires nodes to be
destroyed.
Disallow moving device from switchdev to legacy mode if any node exists
on that device. User must explicitly delete nodes before switching mode.
Example:
$ devlink port function rate add netdevsim/netdevsim10/group1
$ devlink port function rate set netdevsim/netdevsim10/group1 \
tx_share 10mbit tx_max 100mbit
Add + set command can be combined:
$ devlink port function rate add netdevsim/netdevsim10/group1 \
tx_share 10mbit tx_max 100mbit
$ devlink port function rate show netdevsim/netdevsim10/group1
netdevsim/netdevsim10/group1: type node tx_share 10mbit tx_max 100mbit
$ devlink port function rate del netdevsim/netdevsim10/group1
Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:22 +0000 (15:17 +0300)]
devlink: Allow setting tx rate for devlink rate leaf objects
Implement support for DEVLINK_CMD_RATE_SET command with new attributes
DEVLINK_ATTR_RATE_TX_{SHARE|MAX} that are used to set devlink rate
shared/max tx rate values. Extend devlink ops with new callbacks
rate_leaf_tx_{share|max}_set() to allow supporting drivers to implement
rate control through devlink.
New attributes are optional. Driver implementations are allowed to
support either or both of them.
Shared rate example:
$ devlink port function rate set netdevsim/netdevsim10/0 tx_share 10mbit
$ devlink port function rate show netdevsim/netdevsim10/0
netdevsim/netdevsim10/0: type leaf tx_share 10mbit
Max rate example:
$ devlink port function rate set netdevsim/netdevsim10/0 tx_max 100mbit
$ devlink port function rate show netdevsim/netdevsim10/0
netdevsim/netdevsim10/0: type leaf tx_max 100mbit
Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:19 +0000 (15:17 +0300)]
devlink: Introduce rate object
Allow registering rate object for devlink ports with dedicated
devlink_rate_leaf_{create|destroy}() API. Implement new netlink
DEVLINK_CMD_RATE_GET command that is used to retrieve rate object info.
Add new DEVLINK_CMD_RATE_{NEW|DEL} commands that are used for
notifications when creating/deleting leaf rate object.
Rate API is intended to be used for rate limiting of individual
devlink ports (leafs) and their aggregates (nodes).
Example:
$ devlink port show
pci/0000:03:00.0/0
pci/0000:03:00.0/1
$ devlink port function rate show
pci/0000:03:00.0/0: type leaf
pci/0000:03:00.0/1: type leaf
Co-developed-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:18 +0000 (15:17 +0300)]
netdevsim: Implement legacy/switchdev mode for VFs
Implement callbacks to set/get eswitch mode value. Add helpers to check
current mode.
Instantiate VFs' net devices and devlink ports on switchdev enabling and
remove them on legacy enabling. Changing number of VFs while in
switchdev mode triggers VFs creation/deletion.
Also disable NDO API callback to set VF rate, since it's legacy API.
Switchdev API to set VF rate will be implemented in one of the next
patches.
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:17 +0000 (15:17 +0300)]
netdevsim: Implement VFs
Allow creation of netdevsim ports for VFs along with allocations of
corresponding net devices and devlink ports.
Add enums and helpers to distinguish PFs' ports from VFs' ports.
Ports creation/deletion debugfs API intended to be used with physical
ports only.
VFs instantiation will be done in one of the next patches.
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:16 +0000 (15:17 +0300)]
netdevsim: Implement port types and indexing
Define type of ports, which netdevsim driver currently operates with as
PF. Define new port type - VF, which will be implemented in following
patches. Add helper functions to distinguish them. Add helper function
to get VF index from port index.
Add port indexing logic where PFs' indexes starts from 0, VFs' - from
NSIM_DEV_VF_PORT_INDEX_BASE.
All ports uses same index pool, which means that PF port may be created
with index from VFs' indexes range.
Maximum number of VFs, which the driver can allocate, is limited by
UINT_MAX - BASE.
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:15 +0000 (15:17 +0300)]
netdevsim: Disable VFs on nsim_dev_reload_destroy() call
Move VFs disabling from device release() to nsim_dev_reload_destroy() to
make VFs disabling and ports removal simultaneous.
This is a requirement for VFs ports implemented in next patches.
Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmytro Linkin [Wed, 2 Jun 2021 12:17:14 +0000 (15:17 +0300)]
netdevsim: Add max_vfs to bus_dev
Currently there is no limit to the number of VFs netdevsim can enable.
In a real systems this value exist and used by the driver.
Fore example, some features might need to consider this value when
allocating memory.
Expose max_vfs variable to debugfs as configurable resource. If are VFs
configured (num_vfs != 0) then changing of max_vfs not allowed.
Co-developed-by: Yuval Avnery <yuvalav@nvidia.com> Signed-off-by: Yuval Avnery <yuvalav@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Jun 2021 21:04:42 +0000 (14:04 -0700)]
Merge branch 'nfp-ct-offload'
Simon Horman says:
====================
Introduce conntrack offloading to the nfp driver
Louis Peens says:
This is the first in a series of patches to offload conntrack
to the nfp. The approach followed is to flatten out three
different flow rules into a single offloaded flow. The three
different flows are:
1) The rule sending the packet to conntrack (pre_ct)
2) The rule matching on +trk+est after a packet has been through
conntrack. (post_ct)
3) The rule received via callback from the netfilter (nft)
In order to offload a flow we need a combination of all three flows, but
they could be added/deleted at different times and in different order.
To solve this we save potential offloadable CT flows in the driver,
and every time we receive a callback we check against these saved flows
for valid merges. Once we have a valid combination of all three flows
this will be offloaded to the NFP. This is demonstrated in the diagram
below.
+-------------+ +----------+
| pre_ct flow +--------+ | nft flow |
+-------------+ v +------+---+
+----------+ |
| tc_merge +--------+ |
+----------+ v v
+--------------+ ^ +-------------+
| post_ct flow +-------+ +---+nft_tc merge |
+--------------+ | +-------------+
|
|
|
v
Offload to nfp
This series is only up to the point of the pre_ct and post_ct
merges into the tc_merge. Follow up series will continue
to add the nft flows and merging of these flows with the result
of the pre_ct and post_ct merged flows.
Changes since v2:
- nfp: flower-ct: add zone table entry when handling pre/post_ct flows
Fixed another docstring. Should finally have the patch check
environment properly configured now to avoid more of these.
- nfp: flower-ct: add tc merge functionality
Fixed warning found by "kernel test robot <lkp@intel.com>"
Added code comment explaining chain_index comparison
Changes since v1:
- nfp: flower-ct: add ct zone table
Fixed unused variable compile warning
Fixed missing colon in struct description
====================
Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Louis Peens [Wed, 2 Jun 2021 11:59:52 +0000 (13:59 +0200)]
nfp: flower-ct: add tc merge functionality
Add merging of pre/post_ct flow rules into the tc_merge table.
Pre_ct flows needs to be merge with post_ct flows and vice versa.
This needs to be done for all flows in the same zone table, as well
as with the wc_zone_table, which is for flows masking out ct_zone
info.
Cleanup is happening when all the tables are cleared up and prints
a warning traceback as this is not expected in the final version.
At this point we are not actually returning success for the offload,
so we do not get any delete requests for flows, so we can't delete
them that way yet. This means that cleanup happens in what would
usually be an exception path.
Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Louis Peens [Wed, 2 Jun 2021 11:59:51 +0000 (13:59 +0200)]
nfp: flower-ct: add tc_merge_tb
Add the table required to store the merge result of pre_ct and post_ct
flows. This is just the initial setup and teardown of the table,
the implementation will be in follow-up patches.
Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Louis Peens [Wed, 2 Jun 2021 11:59:50 +0000 (13:59 +0200)]
nfp: flower-ct: add a table to map flow cookies to ct flows
Add a hashtable which contains entries to map flow cookies to ct
flow entries. Currently the entries are added and not used, but
follow-up patches will use this for stats updates and flow deletes.
Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Louis Peens [Wed, 2 Jun 2021 11:59:49 +0000 (13:59 +0200)]
nfp: flower-ct: add nfp_fl_ct_flow_entries
This commit starts adding the structures and lists that will
be used in follow up commits to enable offloading of conntrack.
Some stub functions are also introduced as placeholders by
this commit.
Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Louis Peens [Wed, 2 Jun 2021 11:59:48 +0000 (13:59 +0200)]
nfp: flower-ct: add zone table entry when handling pre/post_ct flows
Start populating the pre/post_ct handler functions. Add a zone entry
to the zone table, based on the zone information from the flow. In
the case of a post_ct flow which has a wildcarded match on the zone
create a special entry.
Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Louis Peens [Wed, 2 Jun 2021 11:59:47 +0000 (13:59 +0200)]
nfp: flower-ct: add ct zone table
Add initial zone table to nfp_flower_priv. This table will be used
to store all the information required to offload conntrack.
Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Louis Peens [Wed, 2 Jun 2021 11:59:46 +0000 (13:59 +0200)]
nfp: flower-ct: add pre and post ct checks
Add checks to see if a flow is a conntrack flow we can potentially
handle. Just stub out the handling the different conntrack flows.
Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Louis Peens [Wed, 2 Jun 2021 11:59:45 +0000 (13:59 +0200)]
nfp: flower: move non-zero chain check
This is in preparation for conntrack offload support which makes
used of different chains. Add explicit checks for conntrack and
non-zero chains in the add_offload path.
Signed-off-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yinjun Zhang <yinjun.zhang@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patch 0 documents the MAPv4/v5.
Patch 1 introduces the MAPv5 and the Inline checksum offload for RX/Ingress.
Patch 2 introduces the MAPv5 and the Inline checksum offload for TX/Egress.
A new checksum header format is used as part of MAPv5.For RX checksum offload,
the checksum is verified by the HW and the validity is marked in the checksum
header of MAPv5. For TX, the required metadata is filled up so hardware can
compute the checksum.
v1->v2:
- Fixed the compilation errors, warnings reported by kernel test robot.
- Checksum header definition is expanded to support big, little endian
formats as mentioned by Jakub.
v2->v3:
- Fixed compilation errors reported by kernel bot for big endian flavor.
v3->v4:
- Made changes to use masks instead of C bit-fields as suggested by Jakub/Alex.
v4->v5:
- Corrected checkpatch errors and warnings reported by patchwork.
v5->v6:
- Corrected the bug identified by Alex and incorporated all his comments.
v6->v7:
- Removed duplicate inclusion of linux/bitfield.h in rmnet_map_data.c
v7->v8:
- Have addressed comments given by JAkub on v7 patches.
- As suggested by Jakub, skb_cow_head() is used instead of expanding
the head directly. This is now done in rmnet_map_egress_handler().
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: ethernet: rmnet: Support for ingress MAPv5 checksum offload
Adding support for processing of MAPv5 downlink packets.
It involves parsing the Mapv5 packet and checking the csum header
to know whether the hardware has validated the checksum and is
valid or not.
Based on the checksum valid bit the corresponding stats are
incremented and skb->ip_summed is marked either CHECKSUM_UNNECESSARY
or left as CHEKSUM_NONE to let network stack revalidate the checksum
and update the respective snmp stats.
Current MAPV1 header has been modified, the reserved field in the
Mapv1 header is now used for next header indication.
Signed-off-by: Sharath Chandra Vurukala <sharathv@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 2 Jun 2021 00:07:56 +0000 (17:07 -0700)]
Merge branch 'iwl-next' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/linux
Tony Nguyen says:
====================
iwl-next Intel Wired LAN Driver Updates 2021-06-01
This pull request is targeting net-next and rdma-next branches.
These patches have been reviewed by netdev and rdma mailing lists[1].
This series adds RDMA support to the ice driver for E810 devices and
converts the i40e driver to use the auxiliary bus infrastructure
for X722 devices. The PCI netdev drivers register auxiliary RDMA devices
that will bind to auxiliary drivers registered by the new irdma module.
[1] https://lore.kernel.org/netdev/20210520143809.819-1-shiraz.saleem@intel.com/
---
v3:
- ice_aq_add_rdma_qsets(), ice_cfg_vsi_rdma(), ice_[ena|dis]_vsi_rdma_qset(),
and ice_cfg_rdma_fltr() no longer return ice_status
- Remove null check from ice_aq_add_rdma_qsets()
Changes since linked review (v6):
- Removed unnecessary checks in i40e_client_device_register() and
i40e_client_device_unregister()
- Simplified the i40e_client_device_register() API
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Zheng Yongjun [Tue, 1 Jun 2021 14:16:35 +0000 (22:16 +0800)]
vrf: Fix a typo
possibile ==> possible
Signed-off-by: Zheng Yongjun <zhengyongjun3@huawei.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Tue, 1 Jun 2021 14:02:38 +0000 (22:02 +0800)]
igb: Fix -Wunused-const-variable warning
If CONFIG_IGB_HWMON is n, gcc warns:
drivers/net/ethernet/intel/igb/e1000_82575.c:2765:17:
warning: ‘e1000_emc_therm_limit’ defined but not used [-Wunused-const-variable=]
static const u8 e1000_emc_therm_limit[4] = {
^~~~~~~~~~~~~~~~~~~~~
drivers/net/ethernet/intel/igb/e1000_82575.c:2759:17:
warning: ‘e1000_emc_temp_data’ defined but not used [-Wunused-const-variable=]
static const u8 e1000_emc_temp_data[4] = {
^~~~~~~~~~~~~~~~~~~
Move it into #ifdef block to fix this.
Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
YueHaibing [Tue, 1 Jun 2021 14:01:48 +0000 (22:01 +0800)]
cxgb4: Fix -Wunused-const-variable warning
If CONFIG_PCI_IOV is n, make W=1 warns:
drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c:3909:33:
warning: ‘cxgb4_mgmt_ethtool_ops’ defined but not used [-Wunused-const-variable=]
static const struct ethtool_ops cxgb4_mgmt_ethtool_ops = {
^~~~~~~~~~~~~~~~~~~~~~
Move it into #ifdef block to fix this.
Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
drivers/net/hamradio/bpqether.c:437:36:
warning: ‘bpq_seqops’ defined but not used [-Wunused-const-variable=]
static const struct seq_operations bpq_seqops = {
^~~~~~~~~~
Use #ifdef macro to gurad this.
Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>