1) Reduce number of hardware offload retries from flowtable datapath
which might hog system with retries, from Felix Fietkau.
2) Skip neighbour lookup for PPPoE device, fill_forward_path() already
provides this and set on destination address from fill_forward_path for
PPPoE device, also from Felix.
4) When combining PPPoE on top of a VLAN device, set info->outdev to the
PPPoE device so software offload works, from Felix.
5) Fix TCP teardown flowtable state, races with conntrack gc might result
in resetting the state to ESTABLISHED and the time to one day. Joint
work with Oz Shlomo and Sven Auhagen.
6) Call dst_check() from flowtable datapath to check if dst is stale
instead of doing it from garbage collector path.
7) Disable register tracking infrastructure, either user-space or
kernel need to pre-fetch keys inconditionally, otherwise register
tracking assumes data is already available in register that might
not well be there, leading to incorrect reductions.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: disable expression reduction infra
netfilter: flowtable: move dst_check to packet path
netfilter: flowtable: fix TCP flow teardown
netfilter: nft_flow_offload: fix offload with pppoe + vlan
net: fix dev_fill_forward_path with pppoe + bridge
netfilter: nft_flow_offload: skip dst neigh lookup for ppp devices
netfilter: flowtable: fix excessive hw offload attempts after failure
====================
netfilter: nf_tables: disable expression reduction infra
Either userspace or kernelspace need to pre-fetch keys inconditionally
before comparisons for this to work. Otherwise, register tracking data
is misleading and it might result in reducing expressions which are not
yet registers.
First expression is also guaranteed to be evaluated always, however,
certain expressions break before writing data to registers, before
comparing the data, leaving the register in undetermined state.
This patch disables this infrastructure by now.
Fixes: cdba27802fee ("netfilter: nf_tables: do not reduce read-only expressions") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Ritaro Takenaka [Tue, 17 May 2022 10:55:30 +0000 (12:55 +0200)]
netfilter: flowtable: move dst_check to packet path
Fixes sporadic IPv6 packet loss when flow offloading is enabled.
IPv6 route GC and flowtable GC are not synchronized.
When dst_cache becomes stale and a packet passes through the flow before
the flowtable GC teardowns it, the packet can be dropped.
So, it is necessary to check dst every time in packet path.
Fixes: a7e275dd8482 ("netfilter: nf_flowtable: skip device lookup from interface index") Signed-off-by: Ritaro Takenaka <ritarot634@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
1. ct gc may race to undo the timeout adjustment of the packet path, leaving
the conntrack entry in place with the internal offload timeout (one day).
2. ct gc removes the ct because the IPS_OFFLOAD_BIT is not set and the CLOSE
timeout is reached before the flow offload del.
3. tcp ct is always set to ESTABLISHED with a very long timeout
in flow offload teardown/delete even though the state might be already
CLOSED. Also as a remark we cannot assume that the FIN or RST packet
is hitting flow table teardown as the packet might get bumped to the
slow path in nftables.
This patch resets IPS_OFFLOAD_BIT from flow_offload_teardown(), so
conntrack handles the tcp rst/fin packet which triggers the CLOSE/FIN
state transition.
Moreover, teturn the connection's ownership to conntrack upon teardown
by clearing the offload flag and fixing the established timeout value.
The flow table GC thread will asynchonrnously free the flow table and
hardware offload entries.
Before this patch, the IPS_OFFLOAD_BIT remained set for expired flows on
which is also misleading since the flow is back to classic conntrack
path.
If nf_ct_delete() removes the entry from the conntrack table, then it
calls nf_ct_put() which decrements the refcnt. This is not a problem
because the flowtable holds a reference to the conntrack object from
flow_offload_alloc() path which is released via flow_offload_free().
This patch also updates nft_flow_offload to skip packets in SYN_RECV
state. Since we might miss or bump packets to slow path, we do not know
what will happen there while we are still in SYN_RECV, this patch
postpones offload up to the next packet which also aligns to the
existing behaviour in tc-ct.
flow_offload_teardown() does not reset the existing tcp state from
flow_offload_fixup_tcp() to ESTABLISHED anymore, packets bump to slow
path might have already update the state to CLOSE/FIN.
Joint work with Oz and Sven.
Fixes: 9152b7338a1d ("netfilter: nf_flow_table: teardown flow timeout race") Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Joel Stanley [Tue, 17 May 2022 09:22:17 +0000 (18:52 +0930)]
net: ftgmac100: Disable hardware checksum on AST2600
The AST2600 when using the i210 NIC over NC-SI has been observed to
produce incorrect checksum results with specific MTU values. This was
first observed when sending data across a long distance set of networks.
On a local network, the following test was performed using a 1MB file of
random data.
On the receiver run this script:
#!/bin/bash
while [ 1 ]; do
# Zero the stats
nstat -r > /dev/null
nc -l 9899 > test-file
# Check for checksum errors
TcpInCsumErrors=$(nstat | grep TcpInCsumErrors)
if [ -z "$TcpInCsumErrors" ]; then
echo No TcpInCsumErrors
else
echo TcpInCsumErrors = $TcpInCsumErrors
fi
done
On an AST2600 system:
# nc <IP of receiver host> 9899 < test-file
The test was repeated with various MTU values:
# ip link set mtu 1410 dev eth0
The observed results:
1500 - good
1434 - bad
1400 - good
1410 - bad
1420 - good
The test was repeated after disabling tx checksumming:
# ethtool -K eth0 tx-checksumming off
And all MTU values tested resulted in transfers without error.
An issue with the driver cannot be ruled out, however there has been no
bug discovered so far.
David has done the work to take the original bug report of slow data
transfer between long distance connections and triaged it down to this
test case.
The vendor suspects this this is a hardware issue when using NC-SI. The
fixes line refers to the patch that introduced AST2600 support.
Reported-by: David Wilder <wilder@us.ibm.com> Reviewed-by: Dylan Hung <dylan_hung@aspeedtech.com> Signed-off-by: Joel Stanley <joel@jms.id.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Kevin Mitchell [Tue, 17 May 2022 18:01:05 +0000 (11:01 -0700)]
igb: skip phy status check where unavailable
igb_read_phy_reg() will silently return, leaving phy_data untouched, if
hw->ops.read_reg isn't set. Depending on the uninitialized value of
phy_data, this led to the phy status check either succeeding immediately
or looping continuously for 2 seconds before emitting a noisy err-level
timeout. This message went out to the console even though there was no
actual problem.
Instead, first check if there is read_reg function pointer. If not,
proceed without trying to check the phy status register.
Fixes: 2929c377ead7 ("igb: When GbE link up, wait for Remote receiver status condition") Signed-off-by: Kevin Mitchell <kevmitch@arista.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Lin Ma [Wed, 18 May 2022 10:53:21 +0000 (18:53 +0800)]
nfc: pn533: Fix buggy cleanup order
When removing the pn533 device (i2c or USB), there is a logic error. The
original code first cancels the worker (flush_delayed_work) and then
destroys the workqueue (destroy_workqueue), leaving the timer the last
one to be deleted (del_timer). This result in a possible race condition
in a multi-core preempt-able kernel. That is, if the cleanup
(pn53x_common_clean) is concurrently run with the timer handler
(pn533_listen_mode_timer), the timer can queue the poll_work to the
already destroyed workqueue, causing use-after-free.
This patch reorder the cleanup: it uses the del_timer_sync to make sure
the handler is finished before the routine will destroy the workqueue.
Note that the timer cannot be activated by the worker again.
if (cur_mod->len == 0 && dev->poll_mod_count > 1)
mod_timer(&dev->listen_timer, ...);
That is, the mod_timer can be called only when pn533_send_poll_frame()
returns no error, which is impossible because the device is detaching
and the lower driver should return ENODEV code.
Signed-off-by: Lin Ma <linma@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 18 May 2022 12:05:43 +0000 (13:05 +0100)]
Merge branch 'mptcp-checksums'
Mat Martineau says:
====================
mptcp: Fix checksum byte order on little-endian
These patches address a bug in the byte ordering of MPTCP checksums on
little-endian architectures. The __sum16 type is always big endian, but
was being cast to u16 and then byte-swapped (on little-endian archs)
when reading/writing the checksum field in MPTCP option headers.
MPTCP checksums are off by default, but are enabled if one or both peers
request it in the SYN/SYNACK handshake.
The corrected code is verified to interoperate between big-endian and
little-endian machines.
Patch 1 fixes the checksum byte order, patch 2 partially mitigates
interoperation with peers sending bad checksums by falling back to TCP
instead of resetting the connection.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Mat Martineau [Tue, 17 May 2022 18:02:12 +0000 (11:02 -0700)]
mptcp: Do TCP fallback on early DSS checksum failure
RFC 8684 section 3.7 describes several opportunities for a MPTCP
connection to "fall back" to regular TCP early in the connection
process, before it has been confirmed that MPTCP options can be
successfully propagated on all SYN, SYN/ACK, and data packets. If a peer
acknowledges the first received data packet with a regular TCP header
(no MPTCP options), fallback is allowed.
If the recipient of that first data packet finds a MPTCP DSS checksum
error, this provides an opportunity to fail gracefully with a TCP
fallback rather than resetting the connection (as might happen if a
checksum failure were detected later).
This commit modifies the checksum failure code to attempt fallback on
the initial subflow of a MPTCP connection, only if it's a failure in the
first data mapping. In cases where the peer initiates the connection,
requests checksums, is the first to send data, and the peer is sending
incorrect checksums (see
https://github.com/multipath-tcp/mptcp_net-next/issues/275), this allows
the connection to proceed as TCP rather than reset.
Fixes: 02fbc4842152 ("mptcp: validate the data checksum") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Tue, 17 May 2022 18:02:11 +0000 (11:02 -0700)]
mptcp: fix checksum byte order
The MPTCP code typecasts the checksum value to u16 and
then converts it to big endian while storing the value into
the MPTCP option.
As a result, the wire encoding for little endian host is
wrong, and that causes interoperabilty interoperability
issues with other implementation or host with different endianness.
Address the issue writing in the packet the unmodified __sum16 value.
MPTCP checksum is disabled by default, interoperating with systems
with bad mptcp-level csum encoding should cause fallback to TCP.
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/275 Fixes: 939a8166ad8d ("mptcp: send out checksum for DSS") Fixes: 7fe09189ef30 ("mptcp: receive checksum for DSS") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Since the recent introduction supporting the SM3 and SM4 hash algos for IPsec, the kernel
produces invalid pfkey acquire messages, when these encryption modules are disabled. This
happens because the availability of the algos wasn't checked in all necessary functions.
This patch adds these checks.
Signed-off-by: Thomas Bartschies <thomas.bartschies@cvk.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Jiasheng Jiang [Tue, 17 May 2022 09:42:31 +0000 (17:42 +0800)]
net: af_key: add check for pfkey_broadcast in function pfkey_process
If skb_clone() returns null pointer, pfkey_broadcast() will
return error.
Therefore, it should be better to check the return value of
pfkey_broadcast() and return error if fails.
In case fw sync reset is called in parallel to device removal, device
might stuck in the following deadlock:
CPU 0 CPU 1
----- -----
remove_one
uninit_one (locks intf_state_mutex)
mlx5_sync_reset_now_event()
work in fw_reset->wq.
mlx5_enter_error_state()
mutex_lock (intf_state_mutex)
cleanup_once
fw_reset_cleanup()
destroy_workqueue(fw_reset->wq)
Drain the fw_reset WQ, and make sure no new work is being queued, before
entering uninit_one().
The Drain is done before devlink_unregister() since fw_reset, in some
flows, is using devlink API devlink_remote_reload_actions_performed().
Paul Blakey [Thu, 7 Apr 2022 10:37:32 +0000 (13:37 +0300)]
net/mlx5e: CT: Fix setting flow_source for smfs ct tuples
Cited patch sets flow_source to ANY overriding the provided spec
flow_source, avoiding the optimization done by commit 131e148b8eb6
("net/mlx5: CT: Set flow source hint from provided tuple device").
To fix the above, set the dr_rule flow_source from provided flow spec.
Gal Pressman [Wed, 13 Apr 2022 12:50:42 +0000 (15:50 +0300)]
net/mlx5e: Remove HW-GRO from reported features
We got reports of certain HW-GRO flows causing kernel call traces, which
might be related to firmware. To be on the safe side, disable the
feature for now and re-enable it once a driver/firmware fix is found.
net/mlx5e: Properly block HW GRO when XDP is enabled
HW GRO is incompatible and mutually exclusive with XDP and XSK. However,
the needed checks are only made when enabling XDP. If HW GRO is enabled
when XDP is already active, the command will succeed, and XDP will be
skipped in the data path, although still enabled.
This commit fixes the bug by checking the XDP and XSK status in
mlx5e_fix_features and disabling HW GRO if XDP is enabled.
LRO is incompatible and mutually exclusive with XDP. However, the needed
checks are only made when enabling XDP. If LRO is enabled when XDP is
already active, the command will succeed, and XDP will be skipped in the
data path, although still enabled.
This commit fixes the bug by checking the XDP status in
mlx5e_fix_features and disabling LRO if XDP is enabled.
Aya Levin [Mon, 11 Apr 2022 14:29:08 +0000 (17:29 +0300)]
net/mlx5e: Block rx-gro-hw feature in switchdev mode
When the driver is in switchdev mode and rx-gro-hw is set, the RQ needs
special CQE handling. Till then, block setting of rx-gro-hw feature in
switchdev mode, to avoid failure while setting the feature due to
failure while opening the RQ.
net/mlx5e: Wrap mlx5e_trap_napi_poll into rcu_read_lock
The body of mlx5e_napi_poll is wrapped into rcu_read_lock to be able to
read the XDP program pointer using rcu_dereference. However, the trap RQ
NAPI doesn't use rcu_read_lock, because the trap RQ works only in the
non-linear mode, and mlx5e_skb_from_cqe_nonlinear, until recently,
didn't support XDP and didn't call rcu_dereference.
Starting from the cited commit, mlx5e_skb_from_cqe_nonlinear supports
XDP and calls rcu_dereference, but mlx5e_trap_napi_poll doesn't wrap it
into rcu_read_lock. It leads to RCU-lockdep warnings like this:
WARNING: suspicious RCU usage
This commit fixes the issue by adding an rcu_read_lock to
mlx5e_trap_napi_poll, similarly to mlx5e_napi_poll.
Fixes: 3d77d6c3366b ("net/mlx5e: Add XDP multi buffer support to the non-linear legacy RQ") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
net/mlx5: DR, Ignore modify TTL on RX if device doesn't support it
When modifying TTL, packet's csum has to be recalculated.
Due to HW issue in ConnectX-5, csum recalculation for modify
TTL on RX is supported through a work-around that is specifically
enabled by configuration.
If the work-around isn't enabled, rather than adding an unsupported
action the modify TTL action on RX should be ignored.
Ignoring modify TTL action might result in zero actions, so in such
cases we will not convert the match STE to modify STE, as it is done
by FW in DMFS.
This patch fixes an issue where modify TTL action was ignored both
on RX and TX instead of only on RX.
Fixes: 2d41cee3c20b ("net/mlx5: DR, Ignore modify TTL if device doesn't support it") Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com> Reviewed-by: Alex Vesker <valex@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Shay Drory [Wed, 9 Mar 2022 12:45:58 +0000 (14:45 +0200)]
net/mlx5: Initialize flow steering during driver probe
Currently, software objects of flow steering are created and destroyed
during reload flow. In case a device is unloaded, the following error
is printed during grace period:
mlx5_core 0000:00:0b.0: mlx5_fw_fatal_reporter_err_work:690:(pid 95):
Driver is in error state. Unloading
As a solution to fix use-after-free bugs, where we try to access
these objects, when reading the value of flow_steering_mode devlink
param[1], let's split flow steering creation and destruction into two
routines:
* init and cleanup: memory, cache, and pools allocation/free.
* create and destroy: namespaces initialization and cleanup.
While at it, re-order the cleanup function to mirror the init function.
Maor Dickman [Mon, 21 Mar 2022 08:07:44 +0000 (10:07 +0200)]
net/mlx5: DR, Fix missing flow_source when creating multi-destination FW table
In order to support multiple destination FTEs with SW steering
FW table is created with single FTE with multiple actions and
SW steering rule forward to it. When creating this table, flow
source isn't set according to the original FTE.
Fix this by passing the original FTE flow source to the created
FW table.
Duoming Zhou [Tue, 17 May 2022 01:25:30 +0000 (09:25 +0800)]
NFC: nci: fix sleep in atomic context bugs caused by nci_skb_alloc
There are sleep in atomic context bugs when the request to secure
element of st-nci is timeout. The root cause is that nci_skb_alloc
with GFP_KERNEL parameter is called in st_nci_se_wt_timeout which is
a timer handler. The call paths that could trigger bugs are shown below:
This patch changes allocation mode of nci_skb_alloc from GFP_KERNEL to
GFP_ATOMIC in order to prevent atomic context sleeping. The GFP_ATOMIC
flag makes memory allocation operation could be used in atomic context.
Fixes: 2d59879f1b84 ("nfc: st-nci: Rename st21nfcb to st-nci") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20220517012530.75714-1-duoming@zju.edu.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
test_bit() tests if one bit is set or not.
Here the logic seems to check of bit QL_RESET_PER_SCSI (i.e. 4) OR bit
QL_RESET_START (i.e. 3) is set.
In fact, it checks if bit 7 (4 | 3 = 7) is set, that is to say
QL_ADAPTER_UP.
This looks harmless, because this bit is likely be set, and when the
ql_reset_work() delayed work is scheduled in ql3xxx_isr() (the only place
that schedule this work), QL_RESET_START or QL_RESET_PER_SCSI is set.
Adaptive-rx and Adaptive-tx are interrupt moderation settings
that can be enabled/disabled using ethtool:
ethtool -C ethX adaptive-rx on/off adaptive-tx on/off
Unfortunately those settings are getting cleared after
changing number of queues, or in ethtool world 'channels':
ethtool -L ethX rx 1 tx 1
Clearing was happening due to introduction of bit fields
in ice_ring_container struct. This way only itr_setting
bits were rebuilt during ice_vsi_rebuild_set_coalesce().
Introduce an anonymous struct of bitfields and create a
union to refer to them as a single variable.
This way variable can be easily saved and restored.
Fixes: ac78f7aec5e5 ("ice: Restore interrupt throttle settings after VSI rebuild") Signed-off-by: Michal Wilczynski <michal.wilczynski@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Paul Greenwalt [Thu, 28 Apr 2022 21:11:42 +0000 (14:11 -0700)]
ice: fix possible under reporting of ethtool Tx and Rx statistics
The hardware statistics counters are not cleared during resets so the
drivers first access is to initialize the baseline and then subsequent
reads are for reporting the counters. The statistics counters are read
during the watchdog subtask when the interface is up. If the baseline
is not initialized before the interface is up, then there can be a brief
window in which some traffic can be transmitted/received before the
initial baseline reading takes place.
Directly initialize ethtool statistics in driver open so the baseline will
be initialized when the interface is up, and any dropped packets
incremented before the interface is up won't be reported.
Fixes: e5885b1286785 ("ice: ignore dropped packets during init") Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Do not allow to write timestamps on RX rings if PF is being configured.
When PF is being configured RX rings can be freed or rebuilt. If at the
same time timestamps are updated, the kernel will crash by dereferencing
null RX ring pointer.
Zixuan Fu [Sat, 14 May 2022 05:07:11 +0000 (13:07 +0800)]
net: vmxnet3: fix possible NULL pointer dereference in vmxnet3_rq_cleanup()
In vmxnet3_rq_create(), when dma_alloc_coherent() fails,
vmxnet3_rq_destroy() is called. It sets rq->rx_ring[i].base to NULL. Then
vmxnet3_rq_create() returns an error to its callers mxnet3_rq_create_all()
-> vmxnet3_change_mtu(). Then vmxnet3_change_mtu() calls
vmxnet3_force_close() -> dev_close() in error handling code. And the driver
calls vmxnet3_close() -> vmxnet3_quiesce_dev() -> vmxnet3_rq_cleanup_all()
-> vmxnet3_rq_cleanup(). In vmxnet3_rq_cleanup(),
rq->rx_ring[ring_idx].base is accessed, but this variable is NULL, causing
a NULL pointer dereference.
To fix this possible bug, an if statement is added to check whether
rq->rx_ring[0].base is NULL in vmxnet3_rq_cleanup() and exit early if so.
The error log in our fault-injection testing is shown as follows:
Zixuan Fu [Sat, 14 May 2022 05:06:56 +0000 (13:06 +0800)]
net: vmxnet3: fix possible use-after-free bugs in vmxnet3_rq_alloc_rx_buf()
In vmxnet3_rq_alloc_rx_buf(), when dma_map_single() fails, rbi->skb is
freed immediately. Similarly, in another branch, when dma_map_page() fails,
rbi->page is also freed. In the two cases, vmxnet3_rq_alloc_rx_buf()
returns an error to its callers vmxnet3_rq_init() -> vmxnet3_rq_init_all()
-> vmxnet3_activate_dev(). Then vmxnet3_activate_dev() calls
vmxnet3_rq_cleanup_all() in error handling code, and rbi->skb or rbi->page
are freed again in vmxnet3_rq_cleanup_all(), causing use-after-free bugs.
To fix these possible bugs, rbi->skb and rbi->page should be cleared after
they are freed.
The error log in our fault-injection testing is shown as follows:
Xin Long [Mon, 16 May 2022 01:37:27 +0000 (21:37 -0400)]
xfrm: set dst dev to blackhole_netdev instead of loopback_dev in ifdown
The global blackhole_netdev has replaced pernet loopback_dev to become the
one given to the object that holds an netdev when ifdown in many places of
ipv4 and ipv6 since commit 45f1fa8c0e5b ("blackhole_netdev: use
blackhole_netdev to invalidate dst entries").
Especially after commit 9c0e24941de2 ("net: allow out-of-order netdev
unregistration"), it's no longer safe to use loopback_dev that may be
freed before other netdev.
This patch is to set dst dev to blackhole_netdev instead of loopback_dev
in ifdown.
Horatiu Vultur [Fri, 13 May 2022 18:00:30 +0000 (20:00 +0200)]
net: lan966x: Fix assignment of the MAC address
The following two scenarios were failing for lan966x.
1. If the port had the address X and then trying to assign the same
address, then the HW was just removing this address because first it
tries to learn new address and then delete the old one. As they are
the same the HW remove it.
2. If the port eth0 was assigned the same address as one of the other
ports eth1 then when assigning back the address to eth0 then the HW
was deleting the address of eth1.
The case 1. is fixed by checking if the port has already the same
address while case 2. is fixed by checking if the address is used by any
other port.
Jonathan Lemon [Fri, 13 May 2022 22:52:31 +0000 (15:52 -0700)]
ptp: ocp: have adjtime handle negative delta_ns correctly
delta_ns is a s64, but it was being passed ptp_ocp_adjtime_coarse
as an u64. Also, it turns out that timespec64_add_ns() only handles
positive values, so perform the math with set_normalized_timespec().
Fixes: d6c5da8943dc ("ptp: ocp: Add ptp_ocp_adjtime_coarse for large adjustments") Suggested-by: Vadim Fedorenko <vfedorenko@novek.ru> Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Acked-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/20220513225231.1412-1-jonathan.lemon@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Felix Fietkau [Mon, 9 May 2022 12:26:16 +0000 (14:26 +0200)]
netfilter: nft_flow_offload: fix offload with pppoe + vlan
When running a combination of PPPoE on top of a VLAN, we need to set
info->outdev to the PPPoE device, otherwise PPPoE encap is skipped
during software offload.
Fixes: 83061518236c ("netfilter: flowtable: add pppoe support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Felix Fietkau [Mon, 9 May 2022 12:26:15 +0000 (14:26 +0200)]
net: fix dev_fill_forward_path with pppoe + bridge
When calling dev_fill_forward_path on a pppoe device, the provided destination
address is invalid. In order for the bridge fdb lookup to succeed, the pppoe
code needs to update ctx->daddr to the correct value.
Fix this by storing the address inside struct net_device_path_ctx
Fixes: 4911b3702b67 ("net: ppp: resolve forwarding path for bridge pppoe devices") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Felix Fietkau [Mon, 9 May 2022 12:26:14 +0000 (14:26 +0200)]
netfilter: nft_flow_offload: skip dst neigh lookup for ppp devices
The dst entry does not contain a valid hardware address, so skip the lookup
in order to avoid running into errors here.
The proper hardware address is filled in from nft_dev_path_info
Fixes: 83061518236c ("netfilter: flowtable: add pppoe support") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Felix Fietkau [Mon, 9 May 2022 12:26:13 +0000 (14:26 +0200)]
netfilter: flowtable: fix excessive hw offload attempts after failure
If a flow cannot be offloaded, the code currently repeatedly tries again as
quickly as possible, which can significantly increase system load.
Fix this by limiting flow timeout update and hardware offload retry to once
per second.
Fixes: cad1732afa56 ("netfilter: flowtable: Remove redundant hw refresh bit") Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
The 'shift' field is not validated, and any value above 31 will
trigger out-of-bounds. The issue predates the git history, but
syzbot was able to trigger it only after the commit mentioned in
the fixes tag, and this change only applies on top of such commit.
Address the issue bounding the 'shift' value to the maximum allowed
by the relevant operator.
Reported-and-tested-by: syzbot+8ed8fc4c57e9dcf23ca6@syzkaller.appspotmail.com Fixes: 5099933915e2 ("net/sched: act_pedit: really ensure the skb is writable") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 16 May 2022 08:39:37 +0000 (09:39 +0100)]
Merge tag 'linux-can-fixes-for-5.18-20220514' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can
Marc Kleine-Budde says:
====================
pull-request: can 2022-05-14
this is a pull request of 2 patches for net/master.
Changes to linux-can-fixes-for-5.18-20220513:
- adjusted Fixes: Tag on "Revert "can: m_can: pci: use custom bit timings for Elkhart Lake""
(Thanks Jakub)
Both patches are by Jarkko Nikula, target the m_can PCI driver
bindings, and fix usage of wrong bit timing constants for the Elkhart
Lake platform.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eyal Birger [Fri, 13 May 2022 20:34:02 +0000 (23:34 +0300)]
xfrm: fix "disable_policy" flag use when arriving from different devices
In IPv4 setting the "disable_policy" flag on a device means no policy
should be enforced for traffic originating from the device. This was
implemented by seting the DST_NOPOLICY flag in the dst based on the
originating device.
However, dsts are cached in nexthops regardless of the originating
devices, in which case, the DST_NOPOLICY flag value may be incorrect.
dst entries for traffic arriving from eth0 would have DST_NOPOLICY
and would be cached and therefore can be reused by traffic originating
from ipsec0, skipping policy check.
Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead
of the DST in IN/FWD IPv4 policy checks.
Jarkko Nikula [Thu, 12 May 2022 12:41:44 +0000 (15:41 +0300)]
can: m_can: remove support for custom bit timing, take #2
Now when Intel Elkhart Lake uses again common bit timing and there are
no other users for custom bit timing, we can bring back the changes
done by the commit 643a5b2ccbb3 ("can: m_can: remove support for
custom bit timing").
This effectively reverts commit 3910ecf0cac3 ("Revert "can: m_can:
remove support for custom bit timing"") while taking into account
commit 4864beedda44 ("can: m_can: make custom bittiming fields const")
and commit 6907f688412a ("can: dev: add sanity check in
can_set_static_ctrlmode()").
Commit 09840e7a5016 ("can: m_can: pci: use custom bit timings for
Elkhart Lake") broke the test case using bitrate switching.
| ip link set can0 up type can bitrate 500000 dbitrate 4000000 fd on
| ip link set can1 up type can bitrate 500000 dbitrate 4000000 fd on
| candump can0 &
| cangen can1 -I 0x800 -L 64 -e -fb \
| -D 11223344deadbeef55667788feedf00daabbccdd44332211 -n 1 -v -v
Above commit does everything correctly according to the datasheet.
However datasheet wasn't correct.
I got confirmation from hardware engineers that the actual CAN
hardware on Intel Elkhart Lake is based on M_CAN version v3.2.0.
Datasheet was mirroring values from an another specification which was
based on earlier M_CAN version leading to wrong bit timings.
Therefore revert the commit and switch back to common bit timings.
Fixes: d5609725ba23 ("can: m_can: pci: use custom bit timings for Elkhart Lake") Link: https://lore.kernel.org/all/20220512124144.536850-1-jarkko.nikula@linux.intel.com Signed-off-by: Jarkko Nikula <jarkko.nikula@linux.intel.com> Reported-by: Chee Hou Ong <chee.houx.ong@intel.com> Reported-by: Aman Kumar <aman.kumar@intel.com> Reported-by: Pallavi Kumari <kumari.pallavi@intel.com> Cc: <stable@vger.kernel.org> # v5.16+ Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Harini Katakam [Thu, 12 May 2022 17:19:00 +0000 (22:49 +0530)]
net: macb: Increment rx bd head after allocating skb and buffer
In gem_rx_refill rx_prepared_head is incremented at the beginning of
the while loop preparing the skb and data buffers. If the skb or data
buffer allocation fails, this BD will be unusable BDs until the head
loops back to the same BD (and obviously buffer allocation succeeds).
In the unlikely event that there's a string of allocation failures,
there will be an equal number of unusable BDs and an inconsistent RX
BD chain. Hence increment the head at the end of the while loop to be
clean.
This series contains a bug fix affecting the in-kernel path manager
(patch 1), where closing subflows would sometimes not adjust the PM's
count of active subflows. Patch 2 updates the selftests to exercise the
new code.
====================
Paolo Abeni [Thu, 12 May 2022 23:26:41 +0000 (16:26 -0700)]
mptcp: fix subflow accounting on close
If the PM closes a fully established MPJ subflow or the subflow
creation errors out in it's early stage the subflows counter is
not bumped accordingly.
This change adds the missing accounting, additionally taking care
of updating accordingly the 'accept_subflow' flag.
Fixes: 50839b6d97c8 ("mptcp: do not block subflows creation on errors") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
As Leonard pointed out, a newly allocated netns can happen
to reuse a freed 'struct net'.
While TCP TW timers were covered by my patches, other things were not:
1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look
at 4-tuple plus the 'struct net' pointer.
2) /proc/net/tcp[6] and inet_diag, same reason.
3) hashinfo->bhash[], same reason.
Fixing all this seems risky, lets instead revert.
In the future, we might have a per netns tcp hash table, or
a per netns list of timewait sockets...
Fixes: 8824bc0f9c93 ("tcp/dccp: get rid of inet_twsk_purge()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Leonard Crestez <cdleonard@gmail.com> Tested-by: Leonard Crestez <cdleonard@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Thu, 12 May 2022 15:10:33 +0000 (10:10 -0500)]
net: ipa: get rid of a duplicate initialization
In ipa_qmi_ready(), the "ipa" local variable is set when
initialized, but then set again just before it's first used.
One or the other is enough, so get rid of the first one.
References: https://lore.kernel.org/lkml/200de1bd-0f01-c334-ca18-43eed783dfac@intel.com/ Reported-by: kernel test robot <lkp@intel.com> Fixes: b5839de2348d ("soc: qcom: ipa: AP/modem communications") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Thu, 12 May 2022 15:10:32 +0000 (10:10 -0500)]
net: ipa: record proper RX transaction count
Each time we are notified that some number of transactions on an RX
channel has completed, we record the number of bytes that have been
transferred since the previous notification. We also track the
number of transactions completed, but that is not currently being
calculated correctly; we're currently counting the number of such
notifications, but each notification can represent many transaction
completions. Fix this.
Fixes: e00b9408001bc ("soc: qcom: ipa: the generic software interface") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Thu, 12 May 2022 15:10:31 +0000 (10:10 -0500)]
net: ipa: certain dropped packets aren't accounted for
If an RX endpoint receives packets containing status headers, and a
packet in the buffer is not dropped, ipa_endpoint_skb_copy() is
responsible for wrapping the packet data in an SKB and forwarding it
to ipa_modem_skb_rx() for further processing.
If ipa_endpoint_skb_copy() gets a null pointer from build_skb(), it
just returns early. But in the process it doesn't record that as a
dropped packet in the network device statistics.
Instead, call ipa_modem_skb_rx() whether or not the SKB pointer is
NULL; that function ensures the statistics are properly updated.
Fixes: e4f3afa03b23b ("net: ipa: skip SKB copy if no netdev") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
- wifi: iwlwifi: iwl-dbg: use del_timer_sync() before freeing
- wifi: ath11k: reduce the wait time of 11d scan and hw scan while
adding an interface
- mac80211: fix rx reordering with non explicit / psmp ack policy
- mac80211: reset MBSSID parameters upon connection
- nl80211: fix races in nl80211_set_tx_bitrate_mask()
- tls: fix context leak on tls_device_down
- sched: act_pedit: really ensure the skb is writable
- batman-adv: don't skb_split skbuffs with frag_list
- eth: ocelot: fix various issues with TC actions (null-deref; bad
stats; ineffective drops; ineffective filter removal)"
* tag 'net-5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (61 commits)
tls: Fix context leak on tls_device_down
net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
mlxsw: Avoid warning during ip6gre device removal
net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
net: ethernet: mediatek: ppe: fix wrong size passed to memset()
Bluetooth: Fix the creation of hdev->name
i40e: i40e_main: fix a missing check on list iterator
net/sched: act_pedit: really ensure the skb is writable
s390/lcs: fix variable dereferenced before check
s390/ctcm: fix potential memory leak
s390/ctcm: fix variable dereferenced before check
net: atlantic: verify hw_head_ lies within TX buffer ring
net: atlantic: add check for MAX_SKB_FRAGS
net: atlantic: reduce scope of is_rsc_complete
net: atlantic: fix "frag[0] not initialized"
net: stmmac: fix missing pci_disable_device() on error in stmmac_pci_probe()
net: phy: micrel: Fix incorrect variable type in micrel
decnet: Use container_of() for struct dn_neigh casts
...
Linus Torvalds [Thu, 12 May 2022 17:42:56 +0000 (10:42 -0700)]
Merge branch 'for-5.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup fix from Tejun Heo:
"Waiman's fix for a cgroup2 cpuset bug where it could miss nodes which
were hot-added"
* 'for-5.18-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup/cpuset: Remove cpus_allowed/mems_allowed setup in cpuset_init_smp()
Linus Torvalds [Thu, 12 May 2022 17:21:44 +0000 (10:21 -0700)]
Merge tag 'fixes_for_v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs
Pull fs fixes from Jan Kara:
"Three fixes that I'd still like to get to 5.18:
- add a missing sanity check in the fanotify FAN_RENAME feature
(added in 5.17, let's fix it before it gets wider usage in
userspace)
- udf fix for recently introduced filesystem corruption issue
- writeback fix for a race in inode list handling that can lead to
delayed writeback and possible dirty throttling stalls"
* tag 'fixes_for_v5.18-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
udf: Avoid using stale lengthOfImpUse
writeback: Avoid skipping inode writeback
fanotify: do not allow setting dirent events in mask of non-dir
The commit cited below claims to fix a use-after-free condition after
tls_device_down. Apparently, the description wasn't fully accurate. The
context stayed alive, but ctx->netdev became NULL, and the offload was
torn down without a proper fallback, so a bug was present, but a
different kind of bug.
Due to misunderstanding of the issue, the original patch dropped the
refcount_dec_and_test line for the context to avoid the alleged
premature deallocation. That line has to be restored, because it matches
the refcount_inc_not_zero from the same function, otherwise the contexts
that survived tls_device_down are leaked.
This patch fixes the described issue by restoring refcount_dec_and_test.
After this change, there is no leak anymore, and the fallback to
software kTLS still works.
Fixes: 28fc4d928033 ("net/tls: Fix use-after-free after the TLS device goes down and up") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20220512091830.678684-1-maximmi@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Taehee Yoo [Thu, 12 May 2022 05:47:09 +0000 (05:47 +0000)]
net: sfc: ef10: fix memory leak in efx_ef10_mtd_probe()
In the NIC ->probe() callback, ->mtd_probe() callback is called.
If NIC has 2 ports, ->probe() is called twice and ->mtd_probe() too.
In the ->mtd_probe(), which is efx_ef10_mtd_probe() it allocates and
initializes mtd partiion.
But mtd partition for sfc is shared data.
So that allocated mtd partition data from last called
efx_ef10_mtd_probe() will not be used.
Therefore it must be freed.
But it doesn't free a not used mtd partition data in efx_ef10_mtd_probe().
Guangguan Wang [Thu, 12 May 2022 03:08:20 +0000 (11:08 +0800)]
net/smc: non blocking recvmsg() return -EAGAIN when no data and signal_pending
Non blocking sendmsg will return -EAGAIN when any signal pending
and no send space left, while non blocking recvmsg return -EINTR
when signal pending and no data received. This may makes confused.
As TCP returns -EAGAIN in the conditions described above. Align the
behavior of smc with TCP.
Fixes: 45159abdee88 ("net/smc: add receive timeout check") Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Link: https://lore.kernel.org/r/20220512030820.73848-1-guangguan.wang@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian Fainelli [Thu, 12 May 2022 02:17:31 +0000 (19:17 -0700)]
net: dsa: bcm_sf2: Fix Wake-on-LAN with mac_link_down()
After commit 30273ea6eedd ("net: dsa/bcm_sf2: fix incorrect usage of
state->link") the interface suspend path would call our mac_link_down()
call back which would forcibly set the link down, thus preventing
Wake-on-LAN packets from reaching our management port.
Fix this by looking at whether the port is enabled for Wake-on-LAN and
not clearing the link status in that case to let packets go through.
Amit Cohen [Wed, 11 May 2022 11:57:47 +0000 (14:57 +0300)]
mlxsw: Avoid warning during ip6gre device removal
IPv6 addresses which are used for tunnels are stored in a hash table
with reference counting. When a new GRE tunnel is configured, the driver
is notified and configures it in hardware.
Currently, any change in the tunnel is not applied in the driver. It
means that if the remote address is changed, the driver is not aware of
this change and the first address will be used.
This behavior results in a warning [1] in scenarios such as the
following:
# ip link add name gre1 type ip6gre local 2000::3 remote 2000::fffe tos inherit ttl inherit
# ip link set name gre1 type ip6gre local 2000::3 remote 2000::ffff ttl inherit
# ip link delete gre1
The change of the address is not applied in the driver. Currently, the
driver uses the remote address which is stored in the 'parms' of the
overlay device. When the tunnel is removed, the new IPv6 address is
used, the driver tries to release it, but as it is not aware of the
change, this address is not configured and it warns about releasing non
existing IPv6 address.
Fix it by using the IPv6 address which is cached in the IPIP entry, this
address is the last one that the driver used, so even in cases such the
above, the first address will be released, without any warning.
Florian Fainelli [Wed, 11 May 2022 03:17:51 +0000 (20:17 -0700)]
net: bcmgenet: Check for Wake-on-LAN interrupt probe deferral
The interrupt controller supplying the Wake-on-LAN interrupt line maybe
modular on some platforms (irq-bcm7038-l1.c) and might be probed at a
later time than the GENET driver. We need to specifically check for
-EPROBE_DEFER and propagate that error to ensure that we eventually
fetch the interrupt descriptor.
Fixes: 06c4e54d52fa ("bcmgenet: add WOL IRQ check") Fixes: c528170c7efb ("net: bcmgenet: Avoid touching non-existent interrupt") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Stefan Wahren <stefan.wahren@i2se.com> Link: https://lore.kernel.org/r/20220511031752.2245566-1-f.fainelli@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Yang Yingliang [Wed, 11 May 2022 03:08:29 +0000 (11:08 +0800)]
net: ethernet: mediatek: ppe: fix wrong size passed to memset()
'foe_table' is a pointer, the real size of struct mtk_foe_entry
should be pass to memset().
Fixes: 134df30a2f33 ("net: ethernet: mtk_eth_soc: add support for initializing the PPE") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Acked-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20220511030829.3308094-1-yangyingliang@huawei.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Jakub Kicinski [Thu, 12 May 2022 00:40:39 +0000 (17:40 -0700)]
Merge tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix the creation of hdev->name when index is greater than 9999
* tag 'for-net-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: Fix the creation of hdev->name
====================
Jakub Kicinski [Thu, 12 May 2022 00:33:01 +0000 (17:33 -0700)]
Merge tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v5.18
Second set of fixes for v5.18 and hopefully the last one. We have a
new iwlwifi maintainer, a fix to rfkill ioctl interface and important
fixes to both stack and two drivers.
* tag 'wireless-2022-05-11' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
rfkill: uapi: fix RFKILL_IOCTL_MAX_SIZE ioctl request definition
nl80211: fix locking in nl80211_set_tx_bitrate_mask()
mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
mac80211_hwsim: fix RCU protected chanctx access
mailmap: update Kalle Valo's email
mac80211: Reset MBSSID parameters upon connection
cfg80211: retrieve S1G operating channel number
nl80211: validate S1G channel width
mac80211: fix rx reordering with non explicit / psmp ack policy
ath11k: reduce the wait time of 11d scan and hw scan while add interface
MAINTAINERS: update iwlwifi driver maintainer
iwlwifi: iwl-dbg: Use del_timer_sync() before freeing
====================
Itay Iellin [Sat, 7 May 2022 12:32:48 +0000 (08:32 -0400)]
Bluetooth: Fix the creation of hdev->name
Set a size limit of 8 bytes of the written buffer to "hdev->name"
including the terminating null byte, as the size of "hdev->name" is 8
bytes. If an id value which is greater than 9999 is allocated,
then the "snprintf(hdev->name, sizeof(hdev->name), "hci%d", id)"
function call would lead to a truncation of the id value in decimal
notation.
Set an explicit maximum id parameter in the id allocation function call.
The id allocation function defines the maximum allocated id value as the
maximum id parameter value minus one. Therefore, HCI_MAX_ID is defined
as 10000.
Signed-off-by: Itay Iellin <ieitayie@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Xiaomeng Tong [Tue, 10 May 2022 20:48:46 +0000 (13:48 -0700)]
i40e: i40e_main: fix a missing check on list iterator
The bug is here:
ret = i40e_add_macvlan_filter(hw, ch->seid, vdev->dev_addr, &aq_err);
The list iterator 'ch' will point to a bogus position containing
HEAD if the list is empty or no element is found. This case must
be checked before any use of the iterator, otherwise it will
lead to a invalid memory access.
To fix this bug, use a new variable 'iter' as the list iterator,
while use the origin variable 'ch' as a dedicated pointer to
point to the found element.
Cc: stable@vger.kernel.org Fixes: 7b6ff2ed0a5e9 ("i40e: Add macvlan support on i40e") Signed-off-by: Xiaomeng Tong <xiam0nd.tong@gmail.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20220510204846.2166999-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Paolo Abeni [Tue, 10 May 2022 14:57:34 +0000 (16:57 +0200)]
net/sched: act_pedit: really ensure the skb is writable
Currently pedit tries to ensure that the accessed skb offset
is writable via skb_unclone(). The action potentially allows
touching any skb bytes, so it may end-up modifying shared data.
The above causes some sporadic MPTCP self-test failures, due to
this code:
The above modifies a data byte outside the skb head and the skb is
a cloned one, carrying a TCP output packet.
This change addresses the issue by keeping track of a rough
over-estimate highest skb offset accessed by the action and ensuring
such offset is really writable.
Note that this may cause performance regressions in some scenarios,
but hopefully pedit is not in the critical path.
Fixes: e89869832520 ("act_pedit: access skb->data safely") Acked-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://lore.kernel.org/r/1fcf78e6679d0a287dd61bb0f04730ce33b3255d.1652194627.git.pabeni@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
David S. Miller [Wed, 11 May 2022 11:31:01 +0000 (12:31 +0100)]
Merge branch 's390-net-fixes'
Alexandra Winter says:
====================
s390/net: Cleanup some code checker findings
clean up smatch findings in legacy code. I was not able to provoke
any real failures on my systems, but other hardware reactions,
timing conditions or compiler output, may cause failures.
There are still 2 smatch warnings left in s390/net:
drivers/s390/net/ctcm_main.c:1326 add_channel() warn: missing error code 'rc'
This one is a false positive.
drivers/s390/net/netiucv.c:1355 netiucv_check_user() warn: argument 3 to %02x specifier has type 'char'
Postponing this one, need to better understand string handling in iucv.
There are several sparse warnings left in ctcm, like:
drivers/s390/net/ctcm_fsms.c:573:9: warning: context imbalance in 'ctcm_chx_setmode' - different lock contexts for basic block
Those are mentioned in the source, no plan to rework.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alexandra Winter [Tue, 10 May 2022 07:05:08 +0000 (09:05 +0200)]
s390/lcs: fix variable dereferenced before check
smatch complains about
drivers/s390/net/lcs.c:1741 lcs_get_control() warn: variable dereferenced before check 'card->dev' (see line 1739)
Fixes: 4852da22b200 ("[PATCH] s390: lcs driver bug fixes and improvements [1/2]") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexandra Winter [Tue, 10 May 2022 07:05:07 +0000 (09:05 +0200)]
s390/ctcm: fix potential memory leak
smatch complains about
drivers/s390/net/ctcm_mpc.c:1210 ctcmpc_unpack_skb() warn: possible memory leak of 'mpcginfo'
mpc_action_discontact() did not free mpcginfo. Consolidate the freeing in
ctcmpc_unpack_skb().
Fixes: cb277cb3733b ("ctcm: infrastructure for replaced ctc driver") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alexandra Winter [Tue, 10 May 2022 07:05:06 +0000 (09:05 +0200)]
s390/ctcm: fix variable dereferenced before check
Found by cppcheck and smatch.
smatch complains about
drivers/s390/net/ctcm_sysfs.c:43 ctcm_buffer_write() warn: variable dereferenced before check 'priv' (see line 42)
Fixes: caa8c2f49b23 ("ctcm: rename READ/WRITE defines to avoid redefinitions") Reported-by: Colin Ian King <colin.i.king@gmail.com> Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 11 May 2022 11:25:07 +0000 (12:25 +0100)]
Merge branch 'atlantic-fixes'
Grant Grundler says:
====================
net: atlantic: more fuzzing fixes
It essentially describes four problems:
1) validate rxd_wb->next_desc_ptr before populating buff->next
2) "frag[0] not initialized" case in aq_ring_rx_clean()
3) limit iterations handling fragments in aq_ring_rx_clean()
4) validate hw_head_ in hw_atl_b0_hw_ring_tx_head_update()
(1) was fixed by Zekun Shen <bruceshenzk@gmail.com> around the same time with
"atlantic: Fix buff_ring OOB in aq_ring_rx_clean" (SHA1 bb54d5b0074e3925).
I've added one "clean up" contribution:
"net: atlantic: reduce scope of is_rsc_complete"
I tested the "original" patches using chromeos-v5.4 kernel branch:
https://chromium-review.googlesource.com/q/hashtag:pcinet-atlantic-2022q1+(status:open%20OR%20status:merged)
I've forward ported those patches to 5.18-rc2 and compiled them but am
unable to test them on 5.18-rc2 kernel (logistics problems).
Credit largely goes to ChromeOS Fuzzing team members:
Aashay Shringarpure, Yi Chou, Shervin Oloumi
V2 changes:
o drop first patch - was already fixed upstream differently
o reduce (4) "validate hw_head_" to simple bounds checking.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Grant Grundler [Tue, 10 May 2022 02:28:26 +0000 (19:28 -0700)]
net: atlantic: verify hw_head_ lies within TX buffer ring
Bounds check hw_head index provided by NIC to verify it lies
within the TX buffer ring.
Reported-by: Aashay Shringarpure <aashay@google.com> Reported-by: Yi Chou <yich@google.com> Reported-by: Shervin Oloumi <enlightened@google.com> Signed-off-by: Grant Grundler <grundler@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Grant Grundler [Tue, 10 May 2022 02:28:25 +0000 (19:28 -0700)]
net: atlantic: add check for MAX_SKB_FRAGS
Enforce that the CPU can not get stuck in an infinite loop.
Reported-by: Aashay Shringarpure <aashay@google.com> Reported-by: Yi Chou <yich@google.com> Reported-by: Shervin Oloumi <enlightened@google.com> Signed-off-by: Grant Grundler <grundler@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Grant Grundler [Tue, 10 May 2022 02:28:23 +0000 (19:28 -0700)]
net: atlantic: fix "frag[0] not initialized"
In aq_ring_rx_clean(), if buff->is_eop is not set AND
buff->len < AQ_CFG_RX_HDR_SIZE, then hdr_len remains equal to
buff->len and skb_add_rx_frag(xxx, *0*, ...) is not called.
The loop following this code starts calling skb_add_rx_frag() starting
with i=1 and thus frag[0] is never initialized. Since i is initialized
to zero at the top of the primary loop, we can just reference and
post-increment i instead of hardcoding the 0 when calling
skb_add_rx_frag() the first time.
Reported-by: Aashay Shringarpure <aashay@google.com> Reported-by: Yi Chou <yich@google.com> Reported-by: Shervin Oloumi <enlightened@google.com> Signed-off-by: Grant Grundler <grundler@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Wan Jiabing [Mon, 9 May 2022 14:45:19 +0000 (22:45 +0800)]
net: phy: micrel: Fix incorrect variable type in micrel
In lanphy_read_page_reg, calling __phy_read() might return a negative
error code. Use 'int' to check the error code.
Fixes: 48718dbd2e78 ("net: phy: micrel: Add support for LAN8804 PHY") Signed-off-by: Wan Jiabing <wanjiabing@vivo.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220509144519.2343399-1-wanjiabing@vivo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jan Kara [Tue, 10 May 2022 10:36:04 +0000 (12:36 +0200)]
udf: Avoid using stale lengthOfImpUse
udf_write_fi() uses lengthOfImpUse of the entry it is writing to.
However this field has not yet been initialized so it either contains
completely bogus value or value from last directory entry at that place.
In either case this is wrong and can lead to filesystem corruption or
kernel crashes.
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com> CC: stable@vger.kernel.org Fixes: 714251660b67 ("udf: Get rid of 0-length arrays in struct fileIdentDesc") Signed-off-by: Jan Kara <jack@suse.cz>
Jing Xia [Tue, 10 May 2022 02:35:14 +0000 (10:35 +0800)]
writeback: Avoid skipping inode writeback
We have run into an issue that a task gets stuck in
balance_dirty_pages_ratelimited() when perform I/O stress testing.
The reason we observed is that an I_DIRTY_PAGES inode with lots
of dirty pages is in b_dirty_time list and standard background
writeback cannot writeback the inode.
After studing the relevant code, the following scenario may lead
to the issue:
This patch moves the dirty inode to b_dirty list when the inode
currently is not queued in b_io or b_more_io list at the end of
writeback_single_inode.
Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Christoph Hellwig <hch@lst.de> CC: stable@vger.kernel.org Fixes: d4c16982eb77 ("vfs: add support for a lazytime mount option") Signed-off-by: Jing Xia <jing.xia@unisoc.com> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20220510023514.27399-1-jing.xia@unisoc.com
Manuel Ullmann [Wed, 4 May 2022 19:30:44 +0000 (21:30 +0200)]
net: atlantic: always deep reset on pm op, fixing up my null deref regression
The impact of this regression is the same for resume that I saw on
thaw: the kernel hangs and nothing except SysRq rebooting can be done.
Fixes regression in commit b75b2b7744d3 ("net: atlantic: invert deep
par in pm functions, preventing null derefs"), where I disabled deep
pm resets in suspend and resume, trying to make sense of the
atl_resume_common() deep parameter in the first place.
It turns out, that atlantic always has to deep reset on pm
operations. Even though I expected that and tested resume, I screwed
up by kexec-rebooting into an unpatched kernel, thus missing the
breakage.
This fixup obsoletes the deep parameter of atl_resume_common, but I
leave the cleanup for the maintainers to post to mainline.
Suspend and hibernation were successfully tested by the reporters.
Jakub Kicinski [Tue, 10 May 2022 01:16:46 +0000 (18:16 -0700)]
Merge tag 'batadv-net-pullrequest-20220508' of git://git.open-mesh.org/linux-merge
Simon Wunderlich says:
====================
Here is a batman-adv bugfix:
- Don't skb_split skbuffs with frag_list, by Sven Eckelmann
* tag 'batadv-net-pullrequest-20220508' of git://git.open-mesh.org/linux-merge:
batman-adv: Don't skb_split skbuffs with frag_list
====================
Vladimir Oltean [Sat, 7 May 2022 13:45:50 +0000 (16:45 +0300)]
net: dsa: flush switchdev workqueue on bridge join error path
There is a race between switchdev_bridge_port_offload() and the
dsa_port_switchdev_sync_attrs() call right below it.
When switchdev_bridge_port_offload() finishes, FDB entries have been
replayed by the bridge, but are scheduled for deferred execution later.
However dsa_port_switchdev_sync_attrs -> dsa_port_can_apply_vlan_filtering()
may impose restrictions on the vlan_filtering attribute and refuse
offloading.
When this happens, the delayed FDB entries will dereference dp->bridge,
which is a NULL pointer because we have stopped the process of
offloading this bridge.
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
Workqueue: dsa_ordered dsa_slave_switchdev_event_work
pc : dsa_port_bridge_host_fdb_del+0x64/0x100
lr : dsa_slave_switchdev_event_work+0x130/0x1bc
Call trace:
dsa_port_bridge_host_fdb_del+0x64/0x100
dsa_slave_switchdev_event_work+0x130/0x1bc
process_one_work+0x294/0x670
worker_thread+0x80/0x460
---[ end trace 0000000000000000 ]---
Error: dsa_core: Must first remove VLAN uppers having VIDs also present in bridge.
Fix the bug by doing what we do on the normal bridge leave path as well,
which is to wait until the deferred FDB entries complete executing, then
exit.
The placement of dsa_flush_workqueue() after switchdev_bridge_port_unoffload()
guarantees that both the FDB additions and deletions on rollback are waited for.
Jakub Kicinski [Tue, 10 May 2022 00:54:26 +0000 (17:54 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-05-06
This series contains updates to ice driver only.
Ivan Vecera fixes a race with aux plug/unplug by delaying setting adev
until initialization is complete and adding locking.
Anatolii ensures VF queues are completely disabled before attempting to
reconfigure them.
Michal ensures stale Tx timestamps are cleared from hardware.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: fix PTP stale Tx timestamps cleanup
ice: clear stale Tx queue settings before configuring
ice: Fix race during aux device (un)plugging
====================
net: phy: Fix race condition on link status change
This fixes the following error caused by a race condition between
phydev->adjust_link() and a MDIO transaction in the phy interrupt
handler. The issue was reproduced with the ethernet FEC driver and a
micrel KSZ9031 phy.
[ 146.195696] fec 2188000.ethernet eth0: MDIO read timeout
[ 146.201779] ------------[ cut here ]------------
[ 146.206671] WARNING: CPU: 0 PID: 571 at drivers/net/phy/phy.c:942 phy_error+0x24/0x6c
[ 146.214744] Modules linked in: bnep imx_vdoa imx_sdma evbug
[ 146.220640] CPU: 0 PID: 571 Comm: irq/128-2188000 Not tainted 5.18.0-rc3-00080-gd569e86915b7 #9
[ 146.229563] Hardware name: Freescale i.MX6 Quad/DualLite (Device Tree)
[ 146.236257] unwind_backtrace from show_stack+0x10/0x14
[ 146.241640] show_stack from dump_stack_lvl+0x58/0x70
[ 146.246841] dump_stack_lvl from __warn+0xb4/0x24c
[ 146.251772] __warn from warn_slowpath_fmt+0x5c/0xd4
[ 146.256873] warn_slowpath_fmt from phy_error+0x24/0x6c
[ 146.262249] phy_error from kszphy_handle_interrupt+0x40/0x48
[ 146.268159] kszphy_handle_interrupt from irq_thread_fn+0x1c/0x78
[ 146.274417] irq_thread_fn from irq_thread+0xf0/0x1dc
[ 146.279605] irq_thread from kthread+0xe4/0x104
[ 146.284267] kthread from ret_from_fork+0x14/0x28
[ 146.289164] Exception stack(0xe6fa1fb0 to 0xe6fa1ff8)
[ 146.294448] 1fa0: 00000000000000000000000000000000
[ 146.302842] 1fc0: 0000000000000000000000000000000000000000000000000000000000000000
[ 146.311281] 1fe0: 000000000000000000000000000000000000001300000000
[ 146.318262] irq event stamp: 12325
[ 146.321780] hardirqs last enabled at (12333): [<c01984c4>] __up_console_sem+0x50/0x60
[ 146.330013] hardirqs last disabled at (12342): [<c01984b0>] __up_console_sem+0x3c/0x60
[ 146.338259] softirqs last enabled at (12324): [<c01017f0>] __do_softirq+0x2c0/0x624
[ 146.346311] softirqs last disabled at (12319): [<c01300ac>] __irq_exit_rcu+0x138/0x178
[ 146.354447] ---[ end trace 0000000000000000 ]---
With the FEC driver phydev->adjust_link() calls fec_enet_adjust_link()
calls fec_stop()/fec_restart() and both these function reset and
temporary disable the FEC disrupting any MII transaction that
could be happening at the same time.
fec_enet_adjust_link() and phy_read() can be running at the same time
when we have one additional interrupt before the phy_state_machine() is
able to terminate.
The W=2 build pointed out that the code wasn't initializing all the
variables in the dim_cq_moder declarations with the struct initializers.
The net change here is zero since these structs were already static
const globals and were initialized with zeros by the compiler, but
removing compiler warnings has value in and of itself.
lib/dim/net_dim.c: At top level:
lib/dim/net_dim.c:54:9: warning: missing initializer for field ‘comps’ of ‘const struct dim_cq_moder’ [-Wmissing-field-initializers]
54 | NET_DIM_RX_EQE_PROFILES,
| ^~~~~~~~~~~~~~~~~~~~~~~
In file included from lib/dim/net_dim.c:6:
./include/linux/dim.h:45:13: note: ‘comps’ declared here
45 | u16 comps;
| ^~~~~
and repeats for the tx struct, and once you fix the comps entry then
the cq_period_mode field needs the same treatment.
Use the commonly accepted style to indicate to the compiler that we
know what we're doing, and add a comma at the end of each struct
initializer to clean up the issue, and use explicit initializers
for the fields we are initializing which makes the compiler happy.
While here and fixing these lines, clean up the code slightly with
a fix for the super long lines by removing the word "_MODERATION" from a
couple defines only used in this file.
Jonathan Lemon [Fri, 6 May 2022 22:37:39 +0000 (15:37 -0700)]
ptp: ocp: Use DIV64_U64_ROUND_UP for rounding.
The initial code used roundup() to round the starting time to
a multiple of a period. This generated an error on 32-bit
systems, so was replaced with DIV_ROUND_UP_ULL().
However, this truncates to 32-bits on a 64-bit system. Replace
with DIV64_U64_ROUND_UP() instead.
Linus Torvalds [Mon, 9 May 2022 15:48:59 +0000 (08:48 -0700)]
Merge tag 'platform-drivers-x86-v5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
- thinkpad_acpi AMD suspend/resume + fan detection fixes
- two other small fixes
- one hardware-id addition
* tag 'platform-drivers-x86-v5.18-4' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/surface: aggregator: Fix initialization order when compiling as builtin module
platform/surface: gpe: Add support for Surface Pro 8
platform/x86/intel: Fix 'rmmod pmt_telemetry' panic
platform/x86: thinkpad_acpi: Correct dual fan probe
platform/x86: thinkpad_acpi: Add a s2idle resume quirk for a number of laptops
platform/x86: thinkpad_acpi: Convert btusb DMI list to quirks
The definition of RFKILL_IOCTL_MAX_SIZE introduced by commit 49ad564d7903 ("rfkill: make new event layout opt-in") is unusable
since it is based on RFKILL_IOC_EXT_SIZE which has not been defined.
Fix that by replacing the undefined constant with the constant which
is intended to be used in this definition.
Fixes: 49ad564d7903 ("rfkill: make new event layout opt-in") Cc: stable@vger.kernel.org # 5.11+ Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Signed-off-by: Dmitry V. Levin <ldv@altlinux.org> Link: https://lore.kernel.org/r/20220506172454.120319-1-glebfm@altlinux.org
[add commit message provided later by Dmitry] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Johannes Berg [Thu, 5 May 2022 21:04:22 +0000 (23:04 +0200)]
mac80211_hwsim: call ieee80211_tx_prepare_skb under RCU protection
This is needed since it might use (and pass out) pointers to
e.g. keys protected by RCU. Can't really happen here as the
frames aren't encrypted, but we need to still adhere to the
rules.