Boris Pismenny [Wed, 18 May 2022 09:27:31 +0000 (12:27 +0300)]
tls: Add opt-in zerocopy mode of sendfile()
TLS device offload copies sendfile data to a bounce buffer before
transmitting. It allows to maintain the valid MAC on TLS records when
the file contents change and a part of TLS record has to be
retransmitted on TCP level.
In many common use cases (like serving static files over HTTPS) the file
contents are not changed on the fly. In many use cases breaking the
connection is totally acceptable if the file is changed during
transmission, because it would be received corrupted in any case.
This commit allows to optimize performance for such use cases to
providing a new optional mode of TLS sendfile(), in which the extra copy
is skipped. Removing this copy improves performance significantly, as
TLS and TCP sendfile perform the same operations, and the only overhead
is TLS header/trailer insertion.
The new mode can only be enabled with the new socket option named
TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards
compatibility with existing applications that rely on the copying
behavior.
The new mode is safe, meaning that unsolicited modifications of the file
being sent can't break integrity of the kernel. The worst thing that can
happen is sending a corrupted TLS record, which is in any case not
forbidden when using regular TCP sockets.
Sockets other than TLS device offload are not affected by the new socket
option. The actual status of zerocopy sendfile can be queried with
sock_diag.
Performance numbers in a single-core test with 24 HTTPS streams on
nginx, under 100% CPU load:
Andrew Lunn [Wed, 18 May 2022 00:58:40 +0000 (02:58 +0200)]
net: bridge: Clear offload_fwd_mark when passing frame up bridge interface.
It is possible to stack bridges on top of each other. Consider the
following which makes use of an Ethernet switch:
br1
/ \
/ \
/ \
br0.11 wlan0
|
br0
/ | \
p1 p2 p3
br0 is offloaded to the switch. Above br0 is a vlan interface, for
vlan 11. This vlan interface is then a slave of br1. br1 also has a
wireless interface as a slave. This setup trunks wireless lan traffic
over the copper network inside a VLAN.
A frame received on p1 which is passed up to the bridge has the
skb->offload_fwd_mark flag set to true, indicating that the switch has
dealt with forwarding the frame out ports p2 and p3 as needed. This
flag instructs the software bridge it does not need to pass the frame
back down again. However, the flag is not getting reset when the frame
is passed upwards. As a result br1 sees the flag, wrongly interprets
it, and fails to forward the frame to wlan0.
When passing a frame upwards, clear the flag. This is the Rx
equivalent of br_switchdev_frame_unmark() in br_dev_xmit().
Fixes: f1c2eddf4cb6 ("bridge: switchdev: Use an helper to clear forward mark") Signed-off-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220518005840.771575-1-andrew@lunn.ch Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jonathan Lemon [Tue, 17 May 2022 21:46:00 +0000 (14:46 -0700)]
ptp: ocp: change sysfs attr group handling
In the detach path, the driver calls sysfs_remove_group() for the
groups it believes has been registered. However, if the group was
never previously registered, then this causes a splat.
Instead, compute the groups that should be registered in advance,
and then call sysfs_create_groups(), which registers them all at once.
Update the error handling appropriately.
Fixes: c205d53c4923 ("ptp: ocp: Add firmware capability bits for feature gating") Reported-by: Zheyu Ma <zheyuma97@gmail.com> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Link: https://lore.kernel.org/r/20220517214600.10606-1-jonathan.lemon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Saeed Mahameed [Wed, 18 May 2022 06:58:20 +0000 (23:58 -0700)]
sfc: siena: Have a unique wrapper ifndef for efx channels header
Both sfc/efx_channels.h and sfc/siena/efx_channels.h used the same
wrapper #ifndef EFX_CHANNELS_H, this patch changes the siena define to be
EFX_SIENA_CHANNELS_H to avoid build system confusion.
This fixes the following build break:
drivers/net/ethernet/sfc/ptp.c:2191:28:
error: ‘efx_copy_channel’ undeclared here (not in a function); did you mean ‘efx_ptp_channel’?
2191 | .copy = efx_copy_channel,
Fixes: 6e173d3b4af9 ("sfc: Copy shared files needed for Siena (part 1)") Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Cc: Edward Cree <ecree.xilinx@gmail.com> Acked-by: Martin Habets <habetsm.xilinx@gmail.com> Link: https://lore.kernel.org/r/20220518065820.131611-1-saeed@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
octeon_ep: Fix irq releasing in the error handling path of octep_request_irqs()
When taken, the error handling path does not undo correctly what has
already been allocated.
Introduce a new loop index, 'j', in order to simplify the error handling
path and rewrite part of it.
It is now written with the same logic and intermediate variables used
when resources are allocated. This is much more straightforward.
Fixes: 37d79d059606 ("octeon_ep: add Tx/Rx processing and interrupt support") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 19 May 2022 02:58:36 +0000 (19:58 -0700)]
Merge branch 'adin-add-support-for-clock-output'
Josua Mayer says:
====================
adin: add support for clock output
This patch series adds support for configuring the two clock outputs of adin
1200 and 1300 PHYs. Certain network controllers require an external reference
clock which can be provided by the PHY.
One of the replies to v1 was asking why the common clock framework isn't used.
Currently no PHY driver has implemented providing a clock to the network
controller. Instead they rely on vendor extensions to make the appropriate
configuration. For example ar8035 uses qca,clk-out-frequency - this patchset
aimed to replicate the same functionality.
Finally the 125MHz free-running clock is enabled in the device-tree for
SolidRun i.MX6 SoMs, to support revisions 1.9 and later, where the original phy
has been replaced with an adin 1300.
To avoid introducing new warning messages during boot for SoMs before rev 1.9,
the status field of the new phy node is disabled by default, and will be
enabled by U-Boot on demand.
====================
Josua Mayer [Tue, 17 May 2022 08:54:31 +0000 (11:54 +0300)]
ARM: dts: imx6qdl-sr-som: update phy configuration for som revision 1.9
Since SoM revision 1.9 the PHY has been replaced with an ADIN1300,
add an entry for it next to the original.
As Russell King pointed out, additional phy nodes cause warnings like:
mdio_bus 2188000.ethernet-1: MDIO device at address 1 is missing
To avoid this the new node has its status set to disabled. U-Boot will
be modified to enable the appropriate phy node after probing.
The existing ar8035 nodes have to stay enabled by default to avoid
breaking existing systems when they update Linux only.
Josua Mayer [Tue, 17 May 2022 08:54:30 +0000 (11:54 +0300)]
net: phy: adin: add support for clock output
The ADIN1300 supports generating certain clocks on its GP_CLK pin, as
well as providing the reference clock on CLK25_REF.
Add support for selecting the clock via device-tree properties.
Technically the phy also supports a recovered 125MHz clock for
synchronous ethernet. SyncE should be configured dynamically at
runtime, however Linux does not currently have a toggle for this,
so support is explicitly omitted.
The ADIN1300 supports generating certain clocks on its GP_CLK pin, as
well as providing the reference clock on CLK25_REF.
Add DT properties to configure both pins.
Technically the phy also supports a recovered 125MHz clock for
synchronous ethernet. However SyncE should be configured dynamically at
runtime, so it is explicitly omitted in this binding.
Signed-off-by: Josua Mayer <josua@solid-run.com> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) Reduce number of hardware offload retries from flowtable datapath
which might hog system with retries, from Felix Fietkau.
2) Skip neighbour lookup for PPPoE device, fill_forward_path() already
provides this and set on destination address from fill_forward_path for
PPPoE device, also from Felix.
4) When combining PPPoE on top of a VLAN device, set info->outdev to the
PPPoE device so software offload works, from Felix.
5) Fix TCP teardown flowtable state, races with conntrack gc might result
in resetting the state to ESTABLISHED and the time to one day. Joint
work with Oz Shlomo and Sven Auhagen.
6) Call dst_check() from flowtable datapath to check if dst is stale
instead of doing it from garbage collector path.
7) Disable register tracking infrastructure, either user-space or
kernel need to pre-fetch keys inconditionally, otherwise register
tracking assumes data is already available in register that might
not well be there, leading to incorrect reductions.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: disable expression reduction infra
netfilter: flowtable: move dst_check to packet path
netfilter: flowtable: fix TCP flow teardown
netfilter: nft_flow_offload: fix offload with pppoe + vlan
net: fix dev_fill_forward_path with pppoe + bridge
netfilter: nft_flow_offload: skip dst neigh lookup for ppp devices
netfilter: flowtable: fix excessive hw offload attempts after failure
====================
Linus Torvalds [Thu, 19 May 2022 00:21:30 +0000 (14:21 -1000)]
Merge tag 'io_uring-5.18-2022-05-18' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Two small changes fixing issues from the 5.18 merge window:
- Fix wrong ordering of a tracepoint (Dylan)
- Fix MSG_RING on IOPOLL rings (me)"
* tag 'io_uring-5.18-2022-05-18' of git://git.kernel.dk/linux-block:
io_uring: don't attempt to IOPOLL for MSG_RING requests
io_uring: fix ordering of args in io_uring_queue_async_work
Linus Torvalds [Thu, 19 May 2022 00:19:26 +0000 (14:19 -1000)]
Merge tag 'audit-pr-20220518' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit
Pull audit fix from Paul Moore:
"A single audit patch to fix a problem where a task's audit_context was
not being properly reset with io_uring"
* tag 'audit-pr-20220518' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit,io_uring,io-wq: call __audit_uring_exit for dummy contexts
Linus Torvalds [Thu, 19 May 2022 00:07:43 +0000 (14:07 -1000)]
Merge branch 'arm/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"The SoC bug fixes have calmed down sufficiently, there is one minor
update for the MAINTAINERS file, and few bug fixes for dts
descriptions:
- Updates to the BananaPi R2-Pro (rk3568) dts to match production
hardware rather than the prototype version.
- Qualcomm sm8250 soundwire gets disabled on some machines to avoid
crashes
- A number of aspeed SoC specific fixes, addressing incorrect pin
cotrol settings, some values in the romed8hm board, and a revert
for an accidental removal of a DT node"
* 'arm/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
MAINTAINERS: omap: remove me as a maintainer
ARM: dts: aspeed: Add video engine to g6
ARM: dts: aspeed: romed8hm3: Fix GPIOB0 name
ARM: dts: aspeed: romed8hm3: Add lm25066 sense resistor values
ARM: dts: aspeed-g6: fix SPI1/SPI2 quad pin group
ARM: dts: aspeed-g6: add FWQSPI group in pinctrl dtsi
dt-bindings: pinctrl: aspeed-g6: add FWQSPI function/group
pinctrl: pinctrl-aspeed-g6: add FWQSPI function-group
dt-bindings: pinctrl: aspeed-g6: remove FWQSPID group
pinctrl: pinctrl-aspeed-g6: remove FWQSPID group in pinctrl
ARM: dts: aspeed-g6: remove FWQSPID group in pinctrl dtsi
arm64: dts: qcom: sm8250: don't enable rx/tx macro by default
arm64: dts: rockchip: Add gmac1 and change network settings of bpi-r2-pro
arm64: dts: rockchip: Change io-domains of bpi-r2-pro
Linus Torvalds [Thu, 19 May 2022 00:02:25 +0000 (14:02 -1000)]
Merge branch 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc fixes from Al Viro:
"vhost race fix and a percpu_ref_init-caused cgroup double-free fix.
The latter had manifested as buggered struct mount refcounting - those
are also using percpu data structures, but anything that does percpu
allocations could be hit"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
Fix double fget() in vhost_net_set_backend()
percpu_ref_init(): clean ->percpu_count_ref on failure
Al Viro [Mon, 16 May 2022 08:42:13 +0000 (16:42 +0800)]
Fix double fget() in vhost_net_set_backend()
Descriptor table is a shared resource; two fget() on the same descriptor
may return different struct file references. get_tap_ptr_ring() is
called after we'd found (and pinned) the socket we'll be using and it
tries to find the private tun/tap data structures associated with it.
Redoing the lookup by the same file descriptor we'd used to get the
socket is racy - we need to same struct file.
Thanks to Jason for spotting a braino in the original variant of patch -
I'd missed the use of fd == -1 for disabling backend, and in that case
we can end up with sock == NULL and sock != oldsock.
Cc: stable@kernel.org Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Eli Cohen [Mon, 16 May 2022 08:47:35 +0000 (11:47 +0300)]
vdpa/mlx5: Use consistent RQT size
The current code evaluates RQT size based on the configured number of
virtqueues. This can raise an issue in the following scenario:
Assume MQ was negotiated.
1. mlx5_vdpa_set_map() gets called.
2. handle_ctrl_mq() is called setting cur_num_vqs to some value, lower
than the configured max VQs.
3. A second set_map gets called, but now a smaller number of VQs is used
to evaluate the size of the RQT.
4. handle_ctrl_mq() is called with a value larger than what the RQT can
hold. This will emit errors and the driver state is compromised.
To fix this, we use a new field in struct mlx5_vdpa_net to hold the
required number of entries in the RQT. This value is evaluated in
mlx5_vdpa_set_driver_features() where we have the negotiated features
all set up.
In addition to that, we take into consideration the max capability of RQT
entries early when the device is added so we don't need to take consider
it when creating the RQT.
Last, we remove the use of mlx5_vdpa_max_qps() which just returns the
max_vas / 2 and make the code clearer.
Fixes: 52893733f2c5 ("vdpa/mlx5: Add multiqueue support") Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Eli Cohen <elic@nvidia.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Linus Torvalds [Wed, 18 May 2022 15:53:43 +0000 (05:53 -1000)]
Merge tag 'sound-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of last-minute HD- an USB-audio quirks in addition to a
fix for the legacy ISA wavefront driver.
All look small and easy"
* tag 'sound-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: usb-audio: Restore Rane SL-1 quirk
ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machine
ALSA: hda/realtek: Add quirk for TongFang devices with pop noise
ALSA: hda/realtek: Add quirk for the Framework Laptop
ALSA: wavefront: Proper check of get_user() error
ALSA: hda/realtek: Add quirk for Dell Latitude 7520
ALSA: hda - fix unused Realtek function when PM is not enabled
ALSA: usb-audio: Don't get sample rate for MCT Trigger 5 USB-to-HDMI
netfilter: nf_tables: disable expression reduction infra
Either userspace or kernelspace need to pre-fetch keys inconditionally
before comparisons for this to work. Otherwise, register tracking data
is misleading and it might result in reducing expressions which are not
yet registers.
First expression is also guaranteed to be evaluated always, however,
certain expressions break before writing data to registers, before
comparing the data, leaving the register in undetermined state.
This patch disables this infrastructure by now.
Fixes: b2d306542ff9 ("netfilter: nf_tables: do not reduce read-only expressions") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ritaro Takenaka [Tue, 17 May 2022 10:55:30 +0000 (12:55 +0200)]
netfilter: flowtable: move dst_check to packet path
Fixes sporadic IPv6 packet loss when flow offloading is enabled.
IPv6 route GC and flowtable GC are not synchronized.
When dst_cache becomes stale and a packet passes through the flow before
the flowtable GC teardowns it, the packet can be dropped.
So, it is necessary to check dst every time in packet path.
Fixes: 227e1e4d0d6c ("netfilter: nf_flowtable: skip device lookup from interface index") Signed-off-by: Ritaro Takenaka <ritarot634@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
1. ct gc may race to undo the timeout adjustment of the packet path, leaving
the conntrack entry in place with the internal offload timeout (one day).
2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE
timeout is reached before the flow offload del.
3. tcp ct is always set to ESTABLISHED with a very long timeout
in flow offload teardown/delete even though the state might be already
CLOSED. Also as a remark we cannot assume that the FIN or RST packet
is hitting flow table teardown as the packet might get bumped to the
slow path in nftables.
This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so
conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN
state transition.
Moreover, teturn the connection's ownership to conntrack upon teardown
by clearing the offload flag and fixing the established timeout value.
The flow table GC thread will asynchonrnously free the flow table and
hardware offload entries.
Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on
which is also misleading since the flow is back to classic conntrack
path.
If nf_ct_delete() removes the entry from the conntrack table, then it
calls nf_ct_put() which decrements the refcnt. This is not a problem
because the flowtable holds a reference to the conntrack object from
flow_offload_alloc() path which is released via flow_offload_free().
This patch also updates nft_flow_offload to skip packets in SYN_RECV
state. Since we might miss or bump packets to slow path, we do not know
what will happen there while we are still in SYN_RECV, this patch
postpones offload up to the next packet which also aligns to the
existing behaviour in tc-ct.
flow_offload_teardown() does not reset the existing tcp state from
flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow
path might have already update the state to CLOSE/FIN.
Joint work with Oz and Sven.
Fixes: 1e5b2471bcc4 ("netfilter: nf_flow_table: teardown flow timeout race") Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
drivers/net/ethernet/smsc/smc911x.c: In function ‘smc911x_hardware_send_pkt’:
include/linux/minmax.h:20:28: error: comparison of distinct pointer types lacks a cast [-Werror]
20 | (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
| ^~
drivers/net/ethernet/smsc/smc911x.c:483:17: note: in expansion of macro ‘min’
483 | PRINT_PKT(buf, min(len, 64));
Fix this by making the constant unsigned, to match the type of "len".
While at it, replace the other missed ternary operator by min(), too.
Convert the dummy PRINT_PKT() from a macro to a static inline function,
to catch mistakes like this without having to enable debug options
manually.
Fixes: 5ff0348b7f755aac ("net: smc911x: replace ternary operator with min()") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
Yang Yingliang [Wed, 18 May 2022 02:08:12 +0000 (10:08 +0800)]
net: ethernet: sunplus: add missing of_node_put() in spl2sw_mdio_init()
of_get_child_by_name() returns device node pointer with refcount
incremented. The refcount should be decremented before returning
from spl2sw_mdio_init().
Fixes: fd3040b9394c ("net: ethernet: Add driver for Sunplus SP7021") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Reviewed-by: Wells Lu <wellslutw@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Danielle Ratson [Wed, 18 May 2022 07:27:26 +0000 (10:27 +0300)]
selftests: netdevsim: Increase sleep time in hw_stats_l3.sh test
hw_stats_l3.sh test is failing often for l3 stats shows less than 20
packets after 2 seconds sleep.
This is happening since there is a race between the 2 seconds sleep and
the netdevsim actually delivering the packets.
Increase the sleep time so the packets will be delivered successfully on
time.
Signed-off-by: Danielle Ratson <danieller@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Liška [Wed, 18 May 2022 07:18:53 +0000 (09:18 +0200)]
eth: sun: cassini: remove dead code
Fixes the following GCC warning:
drivers/net/ethernet/sun/cassini.c:1316:29: error: comparison between two arrays [-Werror=array-compare]
drivers/net/ethernet/sun/cassini.c:3783:34: error: comparison between two arrays [-Werror=array-compare]
Note that 2 arrays should be compared by comparing of their addresses:
note: use ‘&cas_prog_workaroundtab[0] == &cas_prog_null[0]’ to compare the addresses
Signed-off-by: Martin Liska <mliska@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
Joel Stanley [Tue, 17 May 2022 09:22:17 +0000 (18:52 +0930)]
net: ftgmac100: Disable hardware checksum on AST2600
The AST2600 when using the i210 NIC over NC-SI has been observed to
produce incorrect checksum results with specific MTU values. This was
first observed when sending data across a long distance set of networks.
On a local network, the following test was performed using a 1MB file of
random data.
On the receiver run this script:
#!/bin/bash
while [ 1 ]; do
# Zero the stats
nstat -r > /dev/null
nc -l 9899 > test-file
# Check for checksum errors
TcpInCsumErrors=$(nstat | grep TcpInCsumErrors)
if [ -z "$TcpInCsumErrors" ]; then
echo No TcpInCsumErrors
else
echo TcpInCsumErrors = $TcpInCsumErrors
fi
done
On an AST2600 system:
# nc <IP of receiver host> 9899 < test-file
The test was repeated with various MTU values:
# ip link set mtu 1410 dev eth0
The observed results:
1500 - good
1434 - bad
1400 - good
1410 - bad
1420 - good
The test was repeated after disabling tx checksumming:
# ethtool -K eth0 tx-checksumming off
And all MTU values tested resulted in transfers without error.
An issue with the driver cannot be ruled out, however there has been no
bug discovered so far.
David has done the work to take the original bug report of slow data
transfer between long distance connections and triaged it down to this
test case.
The vendor suspects this this is a hardware issue when using NC-SI. The
fixes line refers to the patch that introduced AST2600 support.
Reported-by: David Wilder <wilder@us.ibm.com> Reviewed-by: Dylan Hung <dylan_hung@aspeedtech.com> Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Kevin Mitchell [Tue, 17 May 2022 18:01:05 +0000 (11:01 -0700)]
igb: skip phy status check where unavailable
igb_read_phy_reg() will silently return, leaving phy_data untouched, if
hw->ops.read_reg isn't set. Depending on the uninitialized value of
phy_data, this led to the phy status check either succeeding immediately
or looping continuously for 2 seconds before emitting a noisy err-level
timeout. This message went out to the console even though there was no
actual problem.
Instead, first check if there is read_reg function pointer. If not,
proceed without trying to check the phy status register.
Fixes: b72f3f72005d ("igb: When GbE link up, wait for Remote receiver status condition") Signed-off-by: Kevin Mitchell <kevmitch@arista.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The last caller of the stmmac_desc_ops::get_addr() callback was removed
a while ago, so remove the unused callback.
Note that the callback also only gets half the descriptor address on
systems with 64-bit descriptor addresses, so that should be fixed if it
needs to be resurrected later.
Fixes: ec222003bd948de8f3 ("net: stmmac: Prepare to add Split Header support") Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lin Ma [Wed, 18 May 2022 10:53:21 +0000 (18:53 +0800)]
nfc: pn533: Fix buggy cleanup order
When removing the pn533 device (i2c or USB), there is a logic error. The
original code first cancels the worker (flush_delayed_work) and then
destroys the workqueue (destroy_workqueue), leaving the timer the last
one to be deleted (del_timer). This result in a possible race condition
in a multi-core preempt-able kernel. That is, if the cleanup
(pn53x_common_clean) is concurrently run with the timer handler
(pn533_listen_mode_timer), the timer can queue the poll_work to the
already destroyed workqueue, causing use-after-free.
This patch reorder the cleanup: it uses the del_timer_sync to make sure
the handler is finished before the routine will destroy the workqueue.
Note that the timer cannot be activated by the worker again.
if (cur_mod->len == 0 && dev->poll_mod_count > 1)
mod_timer(&dev->listen_timer, ...);
That is, the mod_timer can be called only when pn533_send_poll_frame()
returns no error, which is impossible because the device is detaching
and the lower driver should return ENODEV code.
Signed-off-by: Lin Ma <linma@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 May 2022 12:05:43 +0000 (13:05 +0100)]
Merge branch 'mptcp-checksums'
Mat Martineau says:
====================
mptcp: Fix checksum byte order on little-endian
These patches address a bug in the byte ordering of MPTCP checksums on
little-endian architectures. The __sum16 type is always big endian, but
was being cast to u16 and then byte-swapped (on little-endian archs)
when reading/writing the checksum field in MPTCP option headers.
MPTCP checksums are off by default, but are enabled if one or both peers
request it in the SYN/SYNACK handshake.
The corrected code is verified to interoperate between big-endian and
little-endian machines.
Patch 1 fixes the checksum byte order, patch 2 partially mitigates
interoperation with peers sending bad checksums by falling back to TCP
instead of resetting the connection.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mat Martineau [Tue, 17 May 2022 18:02:12 +0000 (11:02 -0700)]
mptcp: Do TCP fallback on early DSS checksum failure
RFC 8684 section 3.7 describes several opportunities for a MPTCP
connection to "fall back" to regular TCP early in the connection
process, before it has been confirmed that MPTCP options can be
successfully propagated on all SYN, SYN/ACK, and data packets. If a peer
acknowledges the first received data packet with a regular TCP header
(no MPTCP options), fallback is allowed.
If the recipient of that first data packet finds a MPTCP DSS checksum
error, this provides an opportunity to fail gracefully with a TCP
fallback rather than resetting the connection (as might happen if a
checksum failure were detected later).
This commit modifies the checksum failure code to attempt fallback on
the initial subflow of a MPTCP connection, only if it's a failure in the
first data mapping. In cases where the peer initiates the connection,
requests checksums, is the first to send data, and the peer is sending
incorrect checksums (see
https://github.com/multipath-tcp/mptcp_net-next/issues/275), this allows
the connection to proceed as TCP rather than reset.
Fixes: dd8bcd1768ff ("mptcp: validate the data checksum") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 17 May 2022 18:02:11 +0000 (11:02 -0700)]
mptcp: fix checksum byte order
The MPTCP code typecasts the checksum value to u16 and
then converts it to big endian while storing the value into
the MPTCP option.
As a result, the wire encoding for little endian host is
wrong, and that causes interoperabilty interoperability
issues with other implementation or host with different endianness.
Address the issue writing in the packet the unmodified __sum16 value.
MPTCP checksum is disabled by default, interoperating with systems
with bad mptcp-level csum encoding should cause fallback to TCP.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/275 Fixes: c5b39e26d003 ("mptcp: send out checksum for DSS") Fixes: 390b95a5fb84 ("mptcp: receive checksum for DSS") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
armada-3720-turris-mox and orion-mdio
This is a follow up to the change that converted the orion-mdio dt-binding from
txt to DT schema format. At the time I thought the binding needed
'unevaluatedProperties: false' because the core mdio.yaml binding didn't handle
the DSA switches. In reality it was simply the invalid reg property causing the
downstream nodes to be unevaluated. Fixing the reg nodes means we can set
'unevaluatedProperties: true'
Marek,
I don't know if you had a change for the reg properties in flight. I didn't see
anything on lore/lkml so sorry if this crosses with something you've done.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Packham [Mon, 16 May 2022 22:48:01 +0000 (10:48 +1200)]
dt-bindings: net: marvell,orion-mdio: Set unevaluatedProperties to false
When the binding was converted it appeared necessary to set
'unevaluatedProperties: true' because of the switch devices on the
turris-mox board. Actually the error was because of the reg property
being incorrect causing the rest of the properties to be unevaluated.
After the reg properties are fixed for turris-mox we can set
'unevaluatedProperties: false' as is generally expected.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Acked-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Chris Packham [Mon, 16 May 2022 22:48:00 +0000 (10:48 +1200)]
arm64: dts: armada-3720-turris-mox: Correct reg property for mdio devices
MDIO devices have #address-cells = <1>, #size-cells = <0>. Now that we
have a schema enforcing this for marvell,orion-mdio we can see that the
turris-mox has a unnecessary 2nd cell for the switch nodes reg property
of it's switch devices. Remove the unnecessary 2nd cell from the
switches reg property.
Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Reviewed-by: Marek Behún <kabel@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 May 2022 11:51:00 +0000 (12:51 +0100)]
Merge branch 'dsa-microchip-ksz_switch-refactor'
Arun Ramadoss says:
====================
net: dsa: microchip: refactor the ksz switch init function
During the ksz_switch_register function, it calls the individual switches init
functions (ksz8795.c and ksz9477.c). Both these functions have few things in
common like, copying the chip specific data to struct ksz_dev, allocating
ksz_port memory and mib_names memory & cnt. And to add the new LAN937x series
switch, these allocations has to be replicated.
Based on the review feedback of LAN937x part support patch, refactored the
switch init function to move allocations to switch register.
Arun Ramadoss [Tue, 17 May 2022 09:43:33 +0000 (15:13 +0530)]
net: dsa: microchip: remove unused members in ksz_device
The name, regs_size and overrides members in struct ksz_device are
unused. Hence remove it.
And host_mask is used in only place of ksz8795.c file, which can be
replaced by dev->info->cpu_ports
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Ramadoss [Tue, 17 May 2022 09:43:32 +0000 (15:13 +0530)]
net: dsa: microchip: add the phylink get_caps
This patch add the support for phylink_get_caps for ksz8795 and ksz9477
series switch. It updates the struct ksz_switch_chip with the details of
the internal phys and xmii interface. Then during the get_caps based on
the bits set in the structure, corresponding phy mode is set.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: move mib->cnt_ptr reset code to ksz_common.c
mib->cnt_ptr resetting is handled in multiple places as part of
port_init_cnt(). Hence moved mib->cnt_ptr code to ksz common layer
and removed from individual product files.
Signed-off-by: Prasanna Vengateshan <prasanna.vengateshan@microchip.com> Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Ramadoss [Tue, 17 May 2022 09:43:30 +0000 (15:13 +0530)]
net: dsa: microchip: move get_strings to ksz_common
ksz8795 and ksz9477 uses the same algorithm for copying the ethtool
strings. Hence moved to ksz_common to remove the redundant code.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Ramadoss [Tue, 17 May 2022 09:43:29 +0000 (15:13 +0530)]
net: dsa: microchip: move port memory allocation to ksz_common
ksz8795 and ksz9477 init function initializes the memory to dev->ports,
mib counters and assigns the ds real number of ports. Since both the
routines are same, moved the allocation of port memory to
ksz_switch_register after init.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Ramadoss [Tue, 17 May 2022 09:43:28 +0000 (15:13 +0530)]
net: dsa: microchip: move struct mib_names to ksz_chip_data
The ksz88xx family has one set of mib_names. The ksz87xx, ksz9477,
LAN937x based switches has one set of mib_names. In order to remove
redundant declaration, moved the struct mib_names to ksz_chip_data
structure.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Ramadoss [Tue, 17 May 2022 09:43:27 +0000 (15:13 +0530)]
net: dsa: microchip: perform the compatibility check for dev probed
This patch perform the compatibility check for the device after the chip
detect is done. It is to prevent the mismatch between the device
compatible specified in the device tree and actual device found during
the detect. The ksz9477 device doesn't use any .data in the
of_device_id. But the ksz8795_spi uses .data for assigning the regmap
between 8830 family and 87xx family switch. Changed the regmap
assignment based on the chip_id from the .data.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Ramadoss [Tue, 17 May 2022 09:43:26 +0000 (15:13 +0530)]
net: dsa: microchip: move ksz_chip_data to ksz_common
This patch moves the ksz_chip_data in ksz8795 and ksz9477 to ksz_common.
At present, the dev->chip_id is iterated with the ksz_chip_data and then
copy its value to the ksz_dev structure. These values are declared as
constant.
Instead of copying the values and referencing it, this patch update the
dev->info to the ksz_chip_data based on the chip_id in the init
function. And also update the ksz_chip_data values for the LAN937x based
switches.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Arun Ramadoss [Tue, 17 May 2022 09:43:25 +0000 (15:13 +0530)]
net: dsa: microchip: ksz8795: update the port_cnt value in ksz_chip_data
The port_cnt value in the structure is not used in the switch_init.
Instead it uses the fls(chip->cpu_port), this is due to one of port in
the ksz8794 unavailable. The cpu_port for the 8794 is 0x10, fls(0x10) =
5, hence updating it directly in the ksz_chip_data structure in order to
same with all the other switches in ksz8795.c and ksz9477.c files.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 May 2022 10:35:27 +0000 (11:35 +0100)]
Merge tag 'mlx5-updates-2022-05-17' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2022-05-17
MISC updates to mlx5 dirver
1) Aya Levin allows relaxed ordering over VFs
2) Gal Pressman Adds support XDP SQs for uplink representors in switchdev mode
3) Add debugfs TC stats and command failure syndrome for debuggability
4) Tariq uses variants of vzalloc where it could help
5) Multiport eswitch support from Elic Cohen:
Eli Cohen Says:
===============
The multiport eswitch feature allows to forward traffic from a
representor net device to the uplink port of an associated eswitch's
uplink port.
This feature requires creating a LAG object. Since LAG can be created
only once for a function, the feature is mutual exclusive with either
bonding or multipath.
Multipath eswitch mode is entered automatically these conditions are
met:
1. No other LAG related mode is active.
2. A rule that explicitly forwards to an uplink port is inserted.
The implementation maintains a reference count on such rules. When the
reference count reaches zero, the LAG is released and other modes may be
used.
When an explicit rule that explicitly forwards to an uplink port is
inserted while another LAG mode is active, that rule will not be
offloaded by the hardware since the hardware cannot guarantee that the
rule will actually be forwarded to that port.
Example rules that forwards to an uplink port is:
$ tc filter add dev rep0 root flower action mirred egress \
redirect dev uplinkrep0
$ tc filter add dev rep0 root flower action mirred egress \
redirect dev uplinkrep1
This feature is supported only if LAG_RESOURCE_ALLOCATION firmware
configuration parameter is set to true.
The series consists of three patches:
1. Lag state machine refactor
This patch does not add new functionality but rather changes the way
the state of the LAG is maintained.
2. Small fix to remove unused argument.
3. The actual implementation of the feature.
===============
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Since the recent introduction supporting the SM3 and SM4 hash algos for IPsec, the kernel
produces invalid pfkey acquire messages, when these encryption modules are disabled. This
happens because the availability of the algos wasn't checked in all necessary functions.
This patch adds these checks.
Signed-off-by: Thomas Bartschies <thomas.bartschies@cvk.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Jiasheng Jiang [Tue, 17 May 2022 09:42:31 +0000 (17:42 +0800)]
net: af_key: add check for pfkey_broadcast in function pfkey_process
If skb_clone() returns null pointer, pfkey_broadcast() will
return error.
Therefore, it should be better to check the return value of
pfkey_broadcast() and return error if fails.
Eli Cohen [Mon, 31 Jan 2022 05:49:51 +0000 (07:49 +0200)]
net/mlx5: Support multiport eswitch mode
Multiport eswitch mode is a LAG mode that allows to add rules that
forward traffic to a specific physical port without being affected by LAG
affinity configuration.
This mode of operation is mutual exclusive with the other LAG modes used
by multipath and bonding.
To make the transition between the modes, we maintain a counter on the
number of rules specifying one of the uplink representors as the target
of mirred egress redirect action.
An example of such rule would be:
$ tc filter add dev enp8s0f0_0 prot all root flower dst_mac \
00:11:22:33:44:55 action mirred egress redirect dev enp8s0f0
If the reference count just grows to one and LAG is not in use, we
create the LAG in multiport eswitch mode. Other mode changes are not
allowed while in this mode. When the reference count reaches zero, we
destroy the LAG and let other modes be used if needed.
logic also changed such that if forwarding to some uplink destination
cannot be guaranteed, we fail the operation so the rule will eventually
be in software and not in hardware.
Signed-off-by: Eli Cohen <elic@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Eli Cohen [Mon, 24 Jan 2022 09:30:46 +0000 (11:30 +0200)]
net/mlx5: Lag, refactor lag state machine
LAG state machine is implemented using bit flags. However, all these bit
flags, except for MLX5_LAG_FLAG_HASH_BASED, are really mutual exclusive.
In addition, MLX5_LAG_FLAG_READY is used by bonding to mark if we have
our netdevices successfully added to lag and does not really belong in
the same flags variable as the other flags.
Rename MLX5_LAG_FLAG_READY to MLX5_LAG_FLAG_NDEVS_READY to better
reflect its purpose and put it in a new flags variable.
For the rest of the flags, we introduce a mode enum to hold the state
of the LAG.
Remove the shared fdb boolean flag from struct mlx5_lag and store this
configuration as a mode flag.
Change all flag related operations to use standard Linux APIs.
Signed-off-by: Eli Cohen <elic@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Moshe Tal [Wed, 27 Apr 2022 15:26:52 +0000 (18:26 +0300)]
net/mlx5e: Correct the calculation of max channels for rep
Correct the calculation of maximum channels of rep to better utilize
the hardware resources and allow a larger scale of reps.
This will allow creation of all virtual ports configured.
Fixes: 473baf2e9e8c ("net/mlx5e: Allow profile-specific limitation on max num of channels") Signed-off-by: Moshe Tal <moshet@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Saeed Mahameed [Wed, 18 May 2022 03:13:09 +0000 (20:13 -0700)]
net/mlx5e: CT: Add ct driver counters
Connection offload is translated to multiple rules over several
hardware flow tables. Unhandled end-cases may cause a hardware
resource leak causing multiple system symptoms such as a host
memory leak, decreased performance and other scale related issues.
Export the current number of firmware FTEs related to the CT table
as a debugfs counter. Also add a dropped packets counter to help
debug packets dropped on restore failure.
To show the offloaded count:
cat /sys/kernel/debug/mlx5/<PCI>/ct_nic/offloaded
To show the dropped count:
cat /sys/kernel/debug/mlx5/<PCI>/ct_nic/rx_dropped
Signed-off-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Roi Dayan <paulb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Reviewed-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Paul Blakey <paulb@nvidia.com>
Aya Levin [Mon, 11 Apr 2022 06:27:39 +0000 (09:27 +0300)]
net/mlx5e: Allow relaxed ordering over VFs
By PCI spec, the config space of the VF always report relaxed ordering
not supported while it inherits this property from its PF. Hence using
pcie_relaxed_ordering_enable(), always disables the relaxed ordering on
all VFs. Remove this check and rely on the firmware which queries the
config space of the PF and set the capability bit accordingly.
Signed-off-by: Aya Levin <ayal@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Marina Varshaver <marinav@nvidia.com> Reviewed-by: Gal Shalom <galshalom@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Gal Pressman [Thu, 31 Mar 2022 06:26:00 +0000 (09:26 +0300)]
net/mlx5e: Support partial GSO for tunnels over vlans
Offloading outer checksum on tunnels requires GSO partial, add it to
'vlan_features' to allow offloading tunnels over vlans.
For example, running GENEVE over vlan & ipv6 (mandatory UDP checksum)
now allows for hardware TSO instead of software segmentation in GSO
only.
Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Aya Levin <ayal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Gal Pressman [Mon, 11 Apr 2022 12:44:01 +0000 (15:44 +0300)]
net/mlx5e: IPoIB, Improve ethtool rxnfc callback structure in IPoIB
Followup commit 79ce39be1d63 ("net/mlx5e: Improve ethtool rxnfc callback structure")
and handle CONFIG_MLX5_EN_RXNFC enabled/disabled inside the fs layer so
the ethtool callbacks are always available. The fs layer will provide
stubs when CONFIG_MLX5_EN_RXNFC is compiled out.
Moshe Shemesh [Fri, 13 May 2022 03:19:31 +0000 (06:19 +0300)]
net/mlx5: Add last command failure syndrome to debugfs
Add syndrome of last command failure per command type to debugfs to ease
debugging of such failure.
last_failed_syndrome - last command failed syndrome returned by FW.
Al Viro [Wed, 18 May 2022 06:13:40 +0000 (02:13 -0400)]
percpu_ref_init(): clean ->percpu_count_ref on failure
That way percpu_ref_exit() is safe after failing percpu_ref_init().
At least one user (cgroup_create()) had a double-free that way;
there might be other similar bugs. Easier to fix in percpu_ref_init(),
rather than playing whack-a-mole in sloppy users...
Usual symptoms look like a messed refcounting in one of subsystems
that use percpu allocations (might be percpu-refcount, might be
something else). Having refcounts for two different objects share
memory is Not Nice(tm)...
Reported-by: syzbot+5b1e53987f858500ec00@syzkaller.appspotmail.com Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
In case fw sync reset is called in parallel to device removal, device
might stuck in the following deadlock:
CPU 0 CPU 1
----- -----
remove_one
uninit_one (locks intf_state_mutex)
mlx5_sync_reset_now_event()
work in fw_reset->wq.
mlx5_enter_error_state()
mutex_lock (intf_state_mutex)
cleanup_once
fw_reset_cleanup()
destroy_workqueue(fw_reset->wq)
Drain the fw_reset WQ, and make sure no new work is being queued, before
entering uninit_one().
The Drain is done before devlink_unregister() since fw_reset, in some
flows, is using devlink API devlink_remote_reload_actions_performed().
Paul Blakey [Thu, 7 Apr 2022 10:37:32 +0000 (13:37 +0300)]
net/mlx5e: CT: Fix setting flow_source for smfs ct tuples
Cited patch sets flow_source to ANY overriding the provided spec
flow_source, avoiding the optimization done by commit c9c079b4deaa
("net/mlx5: CT: Set flow source hint from provided tuple device").
To fix the above, set the dr_rule flow_source from provided flow spec.
Gal Pressman [Wed, 13 Apr 2022 12:50:42 +0000 (15:50 +0300)]
net/mlx5e: Remove HW-GRO from reported features
We got reports of certain HW-GRO flows causing kernel call traces, which
might be related to firmware. To be on the safe side, disable the
feature for now and re-enable it once a driver/firmware fix is found.
net/mlx5e: Properly block HW GRO when XDP is enabled
HW GRO is incompatible and mutually exclusive with XDP and XSK. However,
the needed checks are only made when enabling XDP. If HW GRO is enabled
when XDP is already active, the command will succeed, and XDP will be
skipped in the data path, although still enabled.
This commit fixes the bug by checking the XDP and XSK status in
mlx5e_fix_features and disabling HW GRO if XDP is enabled.
LRO is incompatible and mutually exclusive with XDP. However, the needed
checks are only made when enabling XDP. If LRO is enabled when XDP is
already active, the command will succeed, and XDP will be skipped in the
data path, although still enabled.
This commit fixes the bug by checking the XDP status in
mlx5e_fix_features and disabling LRO if XDP is enabled.
Aya Levin [Mon, 11 Apr 2022 14:29:08 +0000 (17:29 +0300)]
net/mlx5e: Block rx-gro-hw feature in switchdev mode
When the driver is in switchdev mode and rx-gro-hw is set, the RQ needs
special CQE handling. Till then, block setting of rx-gro-hw feature in
switchdev mode, to avoid failure while setting the feature due to
failure while opening the RQ.
net/mlx5e: Wrap mlx5e_trap_napi_poll into rcu_read_lock
The body of mlx5e_napi_poll is wrapped into rcu_read_lock to be able to
read the XDP program pointer using rcu_dereference. However, the trap RQ
NAPI doesn't use rcu_read_lock, because the trap RQ works only in the
non-linear mode, and mlx5e_skb_from_cqe_nonlinear, until recently,
didn't support XDP and didn't call rcu_dereference.
Starting from the cited commit, mlx5e_skb_from_cqe_nonlinear supports
XDP and calls rcu_dereference, but mlx5e_trap_napi_poll doesn't wrap it
into rcu_read_lock. It leads to RCU-lockdep warnings like this:
WARNING: suspicious RCU usage
This commit fixes the issue by adding an rcu_read_lock to
mlx5e_trap_napi_poll, similarly to mlx5e_napi_poll.
Fixes: ea5d49bdae8b ("net/mlx5e: Add XDP multi buffer support to the non-linear legacy RQ") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
net/mlx5: DR, Ignore modify TTL on RX if device doesn't support it
When modifying TTL, packet's csum has to be recalculated.
Due to HW issue in ConnectX-5, csum recalculation for modify
TTL on RX is supported through a work-around that is specifically
enabled by configuration.
If the work-around isn't enabled, rather than adding an unsupported
action the modify TTL action on RX should be ignored.
Ignoring modify TTL action might result in zero actions, so in such
cases we will not convert the match STE to modify STE, as it is done
by FW in DMFS.
This patch fixes an issue where modify TTL action was ignored both
on RX and TX instead of only on RX.
Fixes: 4ff725e1d4ad ("net/mlx5: DR, Ignore modify TTL if device doesn't support it") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Shay Drory [Wed, 9 Mar 2022 12:45:58 +0000 (14:45 +0200)]
net/mlx5: Initialize flow steering during driver probe
Currently, software objects of flow steering are created and destroyed
during reload flow. In case a device is unloaded, the following error
is printed during grace period:
mlx5_core 0000:00:0b.0: mlx5_fw_fatal_reporter_err_work:690:(pid 95):
Driver is in error state. Unloading
As a solution to fix use-after-free bugs, where we try to access
these objects, when reading the value of flow_steering_mode devlink
param[1], let's split flow steering creation and destruction into two
routines:
* init and cleanup: memory, cache, and pools allocation/free.
* create and destroy: namespaces initialization and cleanup.
While at it, re-order the cleanup function to mirror the init function.
Maor Dickman [Mon, 21 Mar 2022 08:07:44 +0000 (10:07 +0200)]
net/mlx5: DR, Fix missing flow_source when creating multi-destination FW table
In order to support multiple destination FTEs with SW steering
FW table is created with single FTE with multiple actions and
SW steering rule forward to it. When creating this table, flow
source isn't set according to the original FTE.
Fix this by passing the original FTE flow source to the created
FW table.
Duoming Zhou [Tue, 17 May 2022 01:25:30 +0000 (09:25 +0800)]
NFC: nci: fix sleep in atomic context bugs caused by nci_skb_alloc
There are sleep in atomic context bugs when the request to secure
element of st-nci is timeout. The root cause is that nci_skb_alloc
with GFP_KERNEL parameter is called in st_nci_se_wt_timeout which is
a timer handler. The call paths that could trigger bugs are shown below:
This patch changes allocation mode of nci_skb_alloc from GFP_KERNEL to
GFP_ATOMIC in order to prevent atomic context sleeping. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: ed06aeefdac3 ("nfc: st-nci: Rename st21nfcb to st-nci") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220517012530.75714-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Min Li [Mon, 16 May 2022 14:47:06 +0000 (10:47 -0400)]
ptp: ptp_clockmatrix: Add PTP_CLK_REQ_EXTTS support
Use TOD_READ_SECONDARY for extts to keep TOD_READ_PRIMARY
for gettime and settime exclusively. Before this change,
TOD_READ_PRIMARY was used for both extts and gettime/settime,
which would result in changing TOD read/write triggers between
operations. Using TOD_READ_SECONDARY would make extts
independent of gettime/settime operation
====================
net/smc: send and write inline optimization for smc
Send cdc msgs and write data inline if qp has sufficent inline
space, helps latency reducing.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.4us-1.3us improvement in latency.
The results shown below:
msgsize before after
1B 11.9 us 10.6 us (-1.3 us)
2B 11.7 us 10.7 us (-1.0 us)
4B 11.7 us 10.7 us (-1.0 us)
8B 11.6 us 10.6 us (-1.0 us)
16B 11.7 us 10.7 us (-1.0 us)
32B 11.7 us 10.6 us (-1.1 us)
64B 11.7 us 11.2 us (-0.5 us)
128B 11.6 us 11.2 us (-0.4 us)
256B 11.8 us 11.2 us (-0.6 us)
512B 11.8 us 11.3 us (-0.5 us)
1KB 11.9 us 11.5 us (-0.4 us)
2KB 12.1 us 11.5 us (-0.6 us)
====================
Guangguan Wang [Mon, 16 May 2022 05:51:37 +0000 (13:51 +0800)]
net/smc: rdma write inline if qp has sufficient inline space
Rdma write with inline flag when sending small packages,
whose length is shorter than the qp's max_inline_data, can
help reducing latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.5us-0.7us improvement in latency.
The results shown below:
msgsize before after
1B 11.2 us 10.6 us (-0.6 us)
2B 11.2 us 10.7 us (-0.5 us)
4B 11.3 us 10.7 us (-0.6 us)
8B 11.2 us 10.6 us (-0.6 us)
16B 11.3 us 10.7 us (-0.6 us)
32B 11.3 us 10.6 us (-0.7 us)
64B 11.2 us 11.2 us (0 us)
128B 11.2 us 11.2 us (0 us)
256B 11.2 us 11.2 us (0 us)
512B 11.4 us 11.3 us (-0.1 us)
1KB 11.4 us 11.5 us (0.1 us)
2KB 11.5 us 11.5 us (0 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Tested-by: kernel test robot <lkp@intel.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guangguan Wang [Mon, 16 May 2022 05:51:36 +0000 (13:51 +0800)]
net/smc: send cdc msg inline if qp has sufficient inline space
As cdc msg's length is 44B, cdc msgs can be sent inline in
most rdma devices, which can help reducing sending latency.
In my test environment, which are 2 VMs running on the same
physical host and whose NICs(ConnectX-4Lx) are working on
SR-IOV mode, qperf shows 0.4us-0.7us improvement in latency.
The results shown below:
msgsize before after
1B 11.9 us 11.2 us (-0.7 us)
2B 11.7 us 11.2 us (-0.5 us)
4B 11.7 us 11.3 us (-0.4 us)
8B 11.6 us 11.2 us (-0.4 us)
16B 11.7 us 11.3 us (-0.4 us)
32B 11.7 us 11.3 us (-0.4 us)
64B 11.7 us 11.2 us (-0.5 us)
128B 11.6 us 11.2 us (-0.4 us)
256B 11.8 us 11.2 us (-0.6 us)
512B 11.8 us 11.4 us (-0.4 us)
1KB 11.9 us 11.4 us (-0.5 us)
2KB 12.1 us 11.5 us (-0.6 us)
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Tested-by: kernel test robot <lkp@intel.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>