net/funeth: Fix fun_xdp_tx() and XDP packet reclaim
The current implementation of fun_xdp_tx(), used for XPD_TX, is
incorrect in that it takes an address/length pair and later releases it
with page_frag_free(). It is OK for XDP_TX but the same code is used by
ndo_xdp_xmit. In that case it loses the XDP memory type and releases the
packet incorrectly for some of the types. Assorted breakage follows.
Change fun_xdp_tx() to take xdp_frame and rely on xdp_return_frame() in
reclaim.
Jakub Kicinski [Thu, 28 Jul 2022 02:56:28 +0000 (19:56 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-07-26
This series contains updates to ice driver only.
Przemyslaw corrects accounting for VF VLANs to allow for correct number
of VLANs for untrusted VF. He also correct issue with checksum offload
on VXLAN tunnels.
Ani allows for two VSIs to share the same MAC address.
Maciej corrects checked bits for descriptor completion of loopback
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: do not setup vlan for loopback VSI
ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
ice: Fix VSIs unable to share unicast MAC
ice: Fix tunnel checksum offload with fragmented traffic
ice: Fix max VLANs available for VF
====================
This happens when calling sctp_sendmsg without connecting to server first.
In this case, a data chunk already queues up in send queue of client side
when processing the INIT_ACK from server in sctp_process_init() where it
calls sctp_stream_init() to alloc stream_in. If it fails to alloc stream_in
all stream_out will be freed in sctp_stream_init's err path. Then in the
asoc freeing it will crash when dequeuing this data chunk as stream_out
is missing.
As we can't free stream out before dequeuing all data from send queue, and
this patch is to fix it by moving the err path stream_out/in freeing in
sctp_stream_init() to sctp_stream_free() which is eventually called when
freeing the asoc in sctp_association_free(). This fix also makes the code
in sctp_process_init() more clear.
Note that in sctp_association_init() when it fails in sctp_stream_init(),
sctp_association_free() will not be called, and in that case it should
go to 'stream_free' err path to free stream instead of 'fail_init'.
Sending a PTP packet can imply to use the normal TX driver datapath but
invoked from the driver's ptp worker. The kernel generic TX code
disables softirqs and preemption before calling specific driver TX code,
but the ptp worker does not. Although current ptp driver functionality
does not require it, there are several reasons for doing so:
1) The invoked code is always executed with softirqs disabled for non
PTP packets.
2) Better if a ptp packet transmission is not interrupted by softirq
handling which could lead to high latencies.
3) netdev_xmit_more used by the TX code requires preemption to be
disabled.
Indeed a solution for dealing with kernel preemption state based on static
kernel configuration is not possible since the introduction of dynamic
preemption level configuration at boot time using the static calls
functionality.
Eric Dumazet [Tue, 26 Jul 2022 11:57:43 +0000 (11:57 +0000)]
tcp: md5: fix IPv4-mapped support
After the blamed commit, IPv4 SYN packets handled
by a dual stack IPv6 socket are dropped, even if
perfectly valid.
$ nstat | grep MD5
TcpExtTCPMD5Failure 5 0.0
For a dual stack listener, an incoming IPv4 SYN packet
would call tcp_inbound_md5_hash() with @family == AF_INET,
while tp->af_specific is pointing to tcp_sock_ipv6_specific.
Only later when an IPv4-mapped child is created, tp->af_specific
is changed to tcp_sock_ipv6_mapped_specific.
Fixes: fa64e214abdb ("net/tcp: Merge TCP-MD5 inbound callbacks") Reported-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Dmitry Safonov <dima@arista.com> Tested-by: Leonard Crestez <cdleonard@gmail.com> Link: https://lore.kernel.org/r/20220726115743.2759832-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jason Wang [Mon, 25 Jul 2022 07:21:59 +0000 (15:21 +0800)]
virtio-net: fix the race between refill work and close
We try using cancel_delayed_work_sync() to prevent the work from
enabling NAPI. This is insufficient since we don't disable the source
of the refill work scheduling. This means an NAPI poll callback after
cancel_delayed_work_sync() can schedule the refill work then can
re-enable the NAPI that leads to use-after-free [1].
Since the work can enable NAPI, we can't simply disable NAPI before
calling cancel_delayed_work_sync(). So fix this by introducing a
dedicated boolean to control whether or not the work could be
scheduled from NAPI.
[1]
==================================================================
BUG: KASAN: use-after-free in refill_work+0x43/0xd4
Read of size 2 at addr ffff88810562c92e by task kworker/2:1/42
Fixes: 5a39848b6859f ("virtio_net: set/cancel work on ndo_open/ndo_stop") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mat Martineau [Mon, 25 Jul 2022 20:52:31 +0000 (13:52 -0700)]
mptcp: Do not return EINPROGRESS when subflow creation succeeds
New subflows are created within the kernel using O_NONBLOCK, so
EINPROGRESS is the expected return value from kernel_connect().
__mptcp_subflow_connect() has the correct logic to consider EINPROGRESS
to be a successful case, but it has also used that error code as its
return value.
Before v5.19 this was benign: all the callers ignored the return
value. Starting in v5.19 there is a MPTCP_PM_CMD_SUBFLOW_CREATE generic
netlink command that does use the return value, so the EINPROGRESS gets
propagated to userspace.
Make __mptcp_subflow_connect() always return 0 on success instead.
Fixes: 8e381f2c7ea7 ("mptcp: Add handling of outgoing MP_JOIN requests") Fixes: 33ecd4cecca3 ("mptcp: netlink: allow userspace-driven subflow establishment") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20220725205231.87529-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) If nf_queue user requests packet truncation below size of l3 header,
we corrupt the skb, then crash. Reject such requests.
2) add cond_resched() calls when doing cycle detection in the
nf_tables graph. This avoids softlockup warning with certain
rulesets.
3) Reject rulesets that use nftables 'queue' expression in family/chain
combinations other than those that are supported. Currently the ruleset
will load, but when userspace attempts to reinject you get WARN splat +
packet drops.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_queue: only allow supported familes and hooks
netfilter: nf_tables: add rescheduling points during loop detection walks
netfilter: nf_queue: do not allow packet truncation below transport header offset
====================
Jakub Kicinski [Wed, 27 Jul 2022 02:48:24 +0000 (19:48 -0700)]
Merge tag 'for-net-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix early wakeup after suspend
- Fix double free on error
- Fix use-after-free on l2cap_chan_put
* tag 'for-net-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
Bluetooth: Always set event mask on suspend
Bluetooth: mgmt: Fix double free on error path
====================
Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
This fixes the following trace which is caused by hci_rx_work starting up
*after* the final channel reference has been put() during sock_close() but
*before* the references to the channel have been destroyed, so instead
the code now rely on kref_get_unless_zero/l2cap_chan_hold_unless_zero to
prevent referencing a channel that is about to be destroyed.
refcount_t: increment on 0; use-after-free.
BUG: KASAN: use-after-free in refcount_dec_and_test+0x20/0xd0
Read of size 4 at addr ffffffc114f5bf18 by task kworker/u17:14/705
CPU: 4 PID: 705 Comm: kworker/u17:14 Tainted: G S W 4.14.234-00003-g1fb6d0bd49a4-dirty #28
Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150
Google Inc. MSM sm8150 Flame DVT (DT)
Workqueue: hci0 hci_rx_work
Call trace:
dump_backtrace+0x0/0x378
show_stack+0x20/0x2c
dump_stack+0x124/0x148
print_address_description+0x80/0x2e8
__kasan_report+0x168/0x188
kasan_report+0x10/0x18
__asan_load4+0x84/0x8c
refcount_dec_and_test+0x20/0xd0
l2cap_chan_put+0x48/0x12c
l2cap_recv_frame+0x4770/0x6550
l2cap_recv_acldata+0x44c/0x7a4
hci_acldata_packet+0x100/0x188
hci_rx_work+0x178/0x23c
process_one_work+0x35c/0x95c
worker_thread+0x4cc/0x960
kthread+0x1a8/0x1c4
ret_from_fork+0x10/0x18
Cc: stable@kernel.org Reported-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Tested-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
When suspending, always set the event mask once disconnects are
successful. Otherwise, if wakeup is disallowed, the event mask is not
set before suspend continues and can result in an early wakeup.
Fixes: b7ac669c9a6b ("Bluetooth: hci_sync: Rework hci_suspend_notifier") Cc: stable@vger.kernel.org Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop()
lockdep complains use of uninitialized spinlock at ieee80211_do_stop() [1],
for commit d6ce4edfa869a6a6 ("wifi: mac80211: do not wake queues on a vif
that is being stopped") guards clear_bit() using fq.lock even before
fq_init() from ieee80211_txq_setup_flows() initializes this spinlock.
According to discussion [2], Toke was not happy with expanding usage of
fq.lock. Since __ieee80211_wake_txqs() is called under RCU read lock, we
can instead use synchronize_rcu() for flushing ieee80211_wake_txqs().
Currently loopback test is failiing due to the error returned from
ice_vsi_vlan_setup(). Skip calling it when preparing loopback VSI.
Fixes: 9616d5b28366 ("ice: Add handler for ethtool selftest") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
Tx side sets EOP and RS bits on descriptors to indicate that a
particular descriptor is the last one and needs to generate an irq when
it was sent. These bits should not be checked on completion path
regardless whether it's the Tx or the Rx. DD bit serves this purpose and
it indicates that a particular descriptor is either for Rx or was
successfully Txed. EOF is also set as loopback test does not xmit
fragmented frames.
Look at (DD | EOF) bits setting in ice_lbtest_receive_frames() instead
of EOP and RS pair.
Fixes: 9616d5b28366 ("ice: Add handler for ethtool selftest") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
The driver currently does not allow two VSIs in the same PF domain
to have the same unicast MAC address. This is incorrect in the sense
that a policy decision is being made in the driver when it must be
left to the user. This approach was causing issues when rebooting
the system with VFs spawned not being able to change their MAC addresses.
Such errors were present in dmesg:
[ 7921.068237] ice 0000:b6:00.2 ens2f2: Unicast MAC 6a:0d:e4:70:ca:d1 already
exists on this PF. Preventing setting VF 7 unicast MAC address to 6a:0d:e4:70:ca:d1
Fix that by removing this restriction. Doing this also allows
us to remove some additional code that's checking if a unicast MAC
filter already exists.
Fixes: 2f12d696cbab ("ice: Check if unicast MAC exists before setting VF MAC") Signed-off-by: Anirudh Venkataramanan <anirudh.venkataramanan@intel.com> Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
ice: Fix tunnel checksum offload with fragmented traffic
Fix checksum offload on VXLAN tunnels.
In case, when mpls protocol is not used, set l4 header to transport
header of skb. This fixes case, when user tries to offload checksums
of VXLAN tunneled traffic.
Steps for reproduction (requires link partner with tunnels):
ip l s enp130s0f0 up
ip a f enp130s0f0
ip a a 10.10.110.2/24 dev enp130s0f0
ip l s enp130s0f0 mtu 1600
ip link add vxlan12_sut type vxlan id 12 group 238.168.100.100 dev enp130s0f0 dstport 4789
ip l s vxlan12_sut up
ip a a 20.10.110.2/24 dev vxlan12_sut
iperf3 -c 20.10.110.1 #should connect
Offload params: td_offset, cd_tunnel_params were
corrupted, due to l4 header pointing wrong address. NIC would then drop
those packets internally, due to incorrect TX descriptor data,
which increased GLV_TEPC register.
Fixes: 3bd8ff3a0608 ("ice: Add mpls+tso support") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com> Tested-by: Gurucharan <gurucharanx.g@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Legacy VLAN implementation allows for untrusted VF to have 8 VLAN
filters, not counting VLAN 0 filters. Current VLAN_V2 implementation
lowers available filters for VF, by counting in VLAN 0 filter for both
TPIDs.
Fix this by counting only non zero VLAN filters.
Without this patch, untrusted VF would not be able to access 8 VLAN
filters.
Fixes: d81279357b19 ("ice: Add support for VIRTCHNL_VF_OFFLOAD_VLAN_V2") Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com> Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com> Tested-by: Marek Szlosek <marek.szlosek@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Benjamin Poirier [Mon, 25 Jul 2022 00:12:36 +0000 (09:12 +0900)]
bridge: Do not send empty IFLA_AF_SPEC attribute
After commit cb8f2435c6ed ("bridge: Netlink interface fix."),
br_fill_ifinfo() started to send an empty IFLA_AF_SPEC attribute when a
bridge vlan dump is requested but an interface does not have any vlans
configured.
iproute2 ignores such an empty attribute since commit b262a9becbcb
("bridge: Fix output with empty vlan lists") but older iproute2 versions as
well as other utilities have their output changed by the cited kernel
commit, resulting in failed test cases. Regardless, emitting an empty
attribute is pointless and inefficient.
Avoid this change by canceling the attribute if no AF_SPEC data was added.
Fixes: cb8f2435c6ed ("bridge: Netlink interface fix.") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220725001236.95062-1-bpoirier@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Paolo Abeni [Tue, 26 Jul 2022 11:05:46 +0000 (13:05 +0200)]
Merge branch 'octeontx2-minor-tc-fixes'
Subbaraya Sundeep says:
====================
Octeontx2 minor tc fixes
This patch set fixes two problems found in tc code
wrt to ratelimiting and when installing UDP/TCP filters.
Patch 1: CN10K has different register format compared to
CN9xx hence fixes that.
Patch 2: Check flow mask also before installing a src/dst
port filter, otherwise installing for one port installs for other one too.
====================
octeontx2-pf: Fix UDP/TCP src and dst port tc filters
Check the mask for non-zero value before installing tc filters
for L4 source and destination ports. Otherwise installing a
filter for source port installs destination port too and
vice-versa.
NIX_AF_TLXX_PIR/CIR register format has changed from OcteonTx2
to CN10K. CN10K supports larger burst size. Fix burst exponent
and burst mantissa configuration for CN10K.
Also fixed 'maxrate' from u32 to u64 since 'police.rate_bytes_ps'
passed by stack is also u64.
sctp: fix sleep in atomic context bug in timer handlers
There are sleep in atomic context bugs in timer handlers of sctp
such as sctp_generate_t3_rtx_event(), sctp_generate_probe_event(),
sctp_generate_t1_init_event(), sctp_generate_timeout_event(),
sctp_generate_t3_rtx_event() and so on.
The root cause is sctp_sched_prio_init_sid() with GFP_KERNEL parameter
that may sleep could be called by different timer handlers which is in
interrupt context.
One of the call paths that could trigger bug is shown below:
This patch changes gfp_t parameter of init_sid in sctp_sched_set_sched()
from GFP_KERNEL to GFP_ATOMIC in order to prevent sleep in atomic
context bugs.
Vladimir Oltean [Sat, 23 Jul 2022 01:24:11 +0000 (04:24 +0300)]
net: dsa: fix reference counting for LAG FDBs
Due to an invalid conflict resolution on my side while working on 2
different series (LAG FDBs and FDB isolation), dsa_switch_do_lag_fdb_add()
does not store the database associated with a dsa_mac_addr structure.
So after adding an FDB entry associated with a LAG, dsa_mac_addr_find()
fails to find it while deleting it, because &a->db is zeroized memory
for all stored FDB entries of lag->fdbs, and dsa_switch_do_lag_fdb_del()
returns -ENOENT rather than deleting the entry.
i40e: Fix interface init with MSI interrupts (no MSI-X)
Fix the inability to bring an interface up on a setup with
only MSI interrupts enabled (no MSI-X).
Solution is to add a default number of QPs = 1. This is enough,
since without MSI-X support driver enables only a basic feature set.
Fixes: a2d761c94ee9 ("i40e: Fix the number of queues available to be mapped for use") Signed-off-by: Dawid Lukwinski <dawid.lukwinski@intel.com> Signed-off-by: Michal Maloszewski <michal.maloszewski@intel.com> Tested-by: Dave Switzer <david.switzer@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://lore.kernel.org/r/20220722175401.112572-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
ipv4: Fix data-races around sysctl_fib_notify_on_flag_change.
While reading sysctl_fib_notify_on_flag_change, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1d4352c9675c ("net: ipv4: Emit notification when fib hardware flags are changed") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix data-races around sysctl_tcp_reflect_tos.
While reading sysctl_tcp_reflect_tos, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 6553988b8db0 ("tcp: reflect tos value received in SYN to the socket") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_comp_sack_nr.
While reading sysctl_tcp_comp_sack_nr, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 64de30e705b9 ("tcp: add tcp_comp_sack_nr sysctl") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_comp_sack_slack_ns.
While reading sysctl_tcp_comp_sack_slack_ns, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 828bb0ad5f68 ("tcp: add hrtimer slack to sack compression") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_comp_sack_delay_ns.
While reading sysctl_tcp_comp_sack_delay_ns, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 7364cd7c5be7 ("tcp: add tcp_comp_sack_delay_ns sysctl") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mld: fix reference count leak in mld_{query | report}_work()
mld_{query | report}_work() processes queued events.
If there are too many events in the queue, it re-queue a work.
And then, it returns without in6_dev_put().
But if queuing is failed, it should call in6_dev_put(), but it doesn't.
So, a reference count leak would occur.
Script to reproduce(by Hangbin Liu):
ip netns add ns1
ip netns add ns2
ip netns exec ns1 sysctl -w net.ipv6.conf.all.force_mld_version=1
ip netns exec ns2 sysctl -w net.ipv6.conf.all.force_mld_version=1
ip -n ns1 link add veth0 type veth peer name veth0 netns ns2
ip -n ns1 link set veth0 up
ip -n ns2 link set veth0 up
for i in `seq 50`; do
for j in `seq 100`; do
ip -n ns1 addr add 2021:${i}::${j}/64 dev veth0
ip -n ns2 addr add 2022:${i}::${j}/64 dev veth0
done
done
modprobe -r veth
ip -a netns del
splat looks like:
unregister_netdevice: waiting for veth0 to become free. Usage count = 2
leaked reference.
ipv6_add_dev+0x324/0xec0
addrconf_notify+0x481/0xd10
raw_notifier_call_chain+0xe3/0x120
call_netdevice_notifiers+0x106/0x160
register_netdevice+0x114c/0x16b0
veth_newlink+0x48b/0xa50 [veth]
rtnl_newlink+0x11a2/0x1a40
rtnetlink_rcv_msg+0x63f/0xc00
netlink_rcv_skb+0x1df/0x3e0
netlink_unicast+0x5de/0x850
netlink_sendmsg+0x6c9/0xa90
____sys_sendmsg+0x76a/0x780
__sys_sendmsg+0x27c/0x340
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
Tested-by: Hangbin Liu <liuhangbin@gmail.com> Fixes: 76fa0d5bd59a ("mld: add new workqueues for process mld events") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jianglei Nie [Fri, 22 Jul 2022 09:29:02 +0000 (17:29 +0800)]
net: macsec: fix potential resource leak in macsec_add_rxsa() and macsec_add_txsa()
init_rx_sa() allocates relevant resource for rx_sa->stats and rx_sa->
key.tfm with alloc_percpu() and macsec_alloc_tfm(). When some error
occurs after init_rx_sa() is called in macsec_add_rxsa(), the function
released rx_sa with kfree() without releasing rx_sa->stats and rx_sa->
key.tfm, which will lead to a resource leak.
We should call macsec_rxsa_put() instead of kfree() to decrease the ref
count of rx_sa and release the relevant resource if the refcount is 0.
The same bug exists in macsec_add_txsa() for tx_sa as well. This patch
fixes the above two bugs.
Fixes: 37ce3272dda3 ("net: macsec: hardware offloading infrastructure") Signed-off-by: Jianglei Nie <niejianglei2021@163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 25 Jul 2022 10:49:25 +0000 (11:49 +0100)]
Merge branch 'macsec-config-issues'
Sabrina Dubroca says:
====================
macsec: fix config issues
The patch adding netlink support for XPN (commit bb178617ead6
("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)"))
introduced several issues, including a kernel panic reported at [1].
Reproducing those bugs with upstream iproute is limited, since iproute
doesn't currently support XPN. I'm also working on this.
Currently, MACSEC_SA_ATTR_PN is handled inconsistently, sometimes as a
u32, sometimes forced into a u64 without checking the actual length of
the attribute. Instead, we can use nla_get_u64 everywhere, which will
read up to 64 bits into a u64, capped by the actual length of the
attribute coming from userspace.
This fixes several issues:
- the check in validate_add_rxsa doesn't work with 32-bit attributes
- the checks in validate_add_txsa and validate_upd_sa incorrectly
reject X << 32 (with X != 0)
Fixes: bb178617ead6 ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
IEEE 802.1AEbw-2013 (section 10.7.8) specifies that the maximum value
of the replay window is 2^30-1, to help with recovery of the upper
bits of the PN.
To avoid leaving the existing macsec device in an inconsistent state
if this test fails during changelink, reuse the cleanup mechanism
introduced for HW offload. This wasn't needed until now because
macsec_changelink_common could not fail during changelink, as
modifying the cipher suite was not allowed.
Finally, this must happen after handling IFLA_MACSEC_CIPHER_SUITE so
that secy->xpn is set.
Fixes: bb178617ead6 ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
macsec: fix error message in macsec_add_rxsa and _txsa
The expected length is MACSEC_SALT_LEN, not MACSEC_SA_ATTR_SALT.
Fixes: bb178617ead6 ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit bb178617ead6 added a test on tb_sa[MACSEC_SA_ATTR_PN], but
nothing guarantees that it's not NULL at this point. The same code was
added to macsec_add_txsa, but there it's not a problem because
validate_add_txsa checks that the MACSEC_SA_ATTR_PN attribute is
present.
Note: it's not possible to reproduce with iproute, because iproute
doesn't allow creating an SA without specifying the PN.
Fixes: bb178617ead6 ("macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)") Link: https://bugzilla.kernel.org/show_bug.cgi?id=208315 Reported-by: Frantisek Sumsal <fsumsal@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Thu, 21 Jul 2022 14:35:46 +0000 (10:35 -0400)]
Documentation: fix sctp_wmem in ip-sysctl.rst
Since commit 18662ea47f2a ("sctp: implement memory accounting on tx path"),
SCTP has supported memory accounting on tx path where 'sctp_wmem' is used
by sk_wmem_schedule(). So we should fix the description for this option in
ip-sysctl.rst accordingly.
v1->v2:
- Improve the description as Marcelo suggested.
Fixes: 18662ea47f2a ("sctp: implement memory accounting on tx path") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/tls: Remove the context from the list in tls_device_down
tls_device_down takes a reference on all contexts it's going to move to
the degraded state (software fallback). If sk_destruct runs afterwards,
it can reduce the reference counter back to 1 and return early without
destroying the context. Then tls_device_down will release the reference
it took and call tls_device_free_ctx. However, the context will still
stay in tls_device_down_list forever. The list will contain an item,
memory for which is released, making a memory corruption possible.
Fix the above bug by properly removing the context from all lists before
any call to tls_device_free_ctx.
Fixes: bf5c2d483b13 ("tls: Fix context leak on tls_device_down") Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This to-be-reverted commit was meant to apply a stricter rule for the
stack to enter pingpong mode. However, the condition used to check for
interactive session "before(tp->lsndtime, icsk->icsk_ack.lrcvtime)" is
jiffy based and might be too coarse, which delays the stack entering
pingpong mode.
We revert this patch so that we no longer use the above condition to
determine interactive session, and also reduce pingpong threshold to 1.
Fixes: f8c8f5365892 ("tcp: change pingpong threshold to 3") Reported-by: LemmyHuang <hlm3280@163.com> Suggested-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220721204404.388396-1-weiwan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
While the if/then schemas mostly work, there's a few issues. The 'allOf'
schema will also be true if 'fixed-link' is not an array or object as a
false 'if' schema (without an 'else') will be true. In the array case
doesn't set the type (uint32-array) in the 'then' clause. In the node case,
'additionalProperties' is missing.
Rework the schema to use oneOf with each possible type.
Signed-off-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_invalid_ratelimit.
While reading sysctl_tcp_invalid_ratelimit, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 7b805d602346 ("tcp: helpers to mitigate ACK loops by rate-limiting out-of-window dupacks") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_min_rtt_wlen.
While reading sysctl_tcp_min_rtt_wlen, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 50bc59fc3444 ("tcp: track min RTT using windowed min-filter") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_tso_rtt_log.
While reading sysctl_tcp_tso_rtt_log, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 754647c28d8e ("tcp: adjust TSO packet sizes based on min_rtt") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_limit_output_bytes.
While reading sysctl_tcp_limit_output_bytes, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 6475c3971092 ("tcp: TCP Small Queues") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix data-races around sysctl_tcp_workaround_signed_windows.
While reading sysctl_tcp_workaround_signed_windows, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: c34a9eda0549 ("[TCP]: sysctl to allow TCP window > 32767 sans wscale") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix data-races around sysctl_tcp_no_ssthresh_metrics_save.
While reading sysctl_tcp_no_ssthresh_metrics_save, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 1cb407385d84 ("net-tcp: Disable TCP ssthresh metrics cache by default") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Wed, 20 Jul 2022 11:20:57 +0000 (14:20 +0300)]
net: pcs: xpcs: propagate xpcs_read error to xpcs_get_state_c37_sgmii
While phylink_pcs_ops :: pcs_get_state does return void, xpcs_get_state()
does check for a non-zero return code from xpcs_get_state_c37_sgmii()
and prints that as a message to the kernel log.
However, a non-zero return code from xpcs_read() is translated into
"return false" (i.e. zero as int) and the I/O error is therefore not
printed. Fix that.
- eth: iavf: fix handling of dummy receive descriptors
- eth: lan966x: fix issues with MAC table
- eth: stmmac: dwmac-mediatek: fix clock issue
Misc:
- dsa: update documentation"
* tag 'net-5.19-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (107 commits)
mlxsw: spectrum_router: Fix IPv4 nexthop gateway indication
net/sched: cls_api: Fix flow action initialization
tcp: Fix data-races around sysctl_tcp_max_reordering.
tcp: Fix a data-race around sysctl_tcp_abort_on_overflow.
tcp: Fix a data-race around sysctl_tcp_rfc1337.
tcp: Fix a data-race around sysctl_tcp_stdurg.
tcp: Fix a data-race around sysctl_tcp_retrans_collapse.
tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts.
tcp: Fix data-races around sysctl_tcp_recovery.
tcp: Fix a data-race around sysctl_tcp_early_retrans.
tcp: Fix data-races around sysctl knobs related to SYN option.
udp: Fix a data-race around sysctl_udp_l3mdev_accept.
ip: Fix data-races around sysctl_ip_prot_sock.
ipv4: Fix data-races around sysctl_fib_multipath_hash_fields.
ipv4: Fix data-races around sysctl_fib_multipath_hash_policy.
ipv4: Fix a data-race around sysctl_fib_multipath_use_neigh.
can: rcar_canfd: Add missing of_node_put() in rcar_canfd_probe()
can: mcp251xfd: fix detection of mcp251863
Documentation: fix udp_wmem_min in ip-sysctl.rst
...
Peter Zijlstra [Fri, 8 Jul 2022 07:18:06 +0000 (09:18 +0200)]
mmu_gather: Force tlb-flush VM_PFNMAP vmas
Jann reported a race between munmap() and unmap_mapping_range(), where
unmap_mapping_range() will no-op once unmap_vmas() has unlinked the
VMA; however munmap() will not yet have invalidated the TLBs.
Therefore unmap_mapping_range() will complete while there are still
(stale) TLB entries for the specified range.
Mitigate this by force flushing TLBs for VM_PFNMAP ranges.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Zijlstra [Fri, 8 Jul 2022 07:18:05 +0000 (09:18 +0200)]
mmu_gather: Let there be one tlb_{start,end}_vma() implementation
Now that architectures are no longer allowed to override
tlb_{start,end}_vma() re-arrange code so that there is only one
implementation for each of these functions.
This much simplifies trying to figure out what they actually do.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Zijlstra [Fri, 8 Jul 2022 07:18:03 +0000 (09:18 +0200)]
mmu_gather: Remove per arch tlb_{start,end}_vma()
Scattered across the archs are 3 basic forms of tlb_{start,end}_vma().
Provide two new MMU_GATHER_knobs to enumerate them and remove the per
arch tlb_{start,end}_vma() implementations.
- MMU_GATHER_NO_FLUSH_CACHE indicates the arch has flush_cache_range()
but does *NOT* want to call it for each VMA.
- MMU_GATHER_MERGE_VMAS indicates the arch wants to merge the
invalidate across multiple VMAs if possible.
With these it is possible to capture the three forms:
1) empty stubs;
select MMU_GATHER_NO_FLUSH_CACHE and MMU_GATHER_MERGE_VMAS
Obviously, if the architecture does not have flush_cache_range() then
it also doesn't need to select MMU_GATHER_NO_FLUSH_CACHE.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: Will Deacon <will@kernel.org> Cc: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the command 'lx-symbols' in gdb exits with the error`Function
"do_init_module" not defined in "kernel/module.c"`. This occurs because
the file kernel/module.c was moved to kernel/module/main.c.
Fix this breakage by changing the path to "kernel/module/main.c" in
LoadModuleBreakpoint.
Signed-off-by: Khalid Masum <khalid.masum.92@gmail.com> Acked-by: Luis Chamberlain <mcgrof@kernel.org> Fixes: 13c100fcaa1c ("module: Move all into module/") Reviewed-by: Douglas Anderson <dianders@chromium.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Sedat Dilek noticed that I had an extraneous semicolon at the end of a
line in the previous patch.
It's harmless, but unintentional, and while compilers just treat it as
an extra empty statement, for all I know some other tooling might warn
about it. So clean it up before other people notice too ;)
watchqueue: make sure to serialize 'wqueue->defunct' properly
When the pipe is closed, we mark the associated watchqueue defunct by
calling watch_queue_clear(). However, while that is protected by the
watchqueue lock, new watchqueue entries aren't actually added under that
lock at all: they use the pipe->rd_wait.lock instead, and looking up
that pipe happens without any locking.
The watchqueue code uses the RCU read-side section to make sure that the
wqueue entry itself hasn't disappeared, but that does not protect the
pipe_info in any way.
So make sure to actually hold the wqueue lock when posting watch events,
properly serializing against the pipe being torn down.
Eric Snowberg [Wed, 20 Jul 2022 16:40:27 +0000 (12:40 -0400)]
lockdown: Fix kexec lockdown bypass with ima policy
The lockdown LSM is primarily used in conjunction with UEFI Secure Boot.
This LSM may also be used on machines without UEFI. It can also be
enabled when UEFI Secure Boot is disabled. One of lockdown's features
is to prevent kexec from loading untrusted kernels. Lockdown can be
enabled through a bootparam or after the kernel has booted through
securityfs.
If IMA appraisal is used with the "ima_appraise=log" boot param,
lockdown can be defeated with kexec on any machine when Secure Boot is
disabled or unavailable. IMA prevents setting "ima_appraise=log" from
the boot param when Secure Boot is enabled, but this does not cover
cases where lockdown is used without Secure Boot.
To defeat lockdown, boot without Secure Boot and add ima_appraise=log to
the kernel command line; then:
mlxsw needs to distinguish nexthops with a gateway from connected
nexthops in order to write the former to the adjacency table of the
device. The check used to rely on the fact that nexthops with a gateway
have a 'link' scope whereas connected nexthops have a 'host' scope. This
is no longer correct after commit 7c0d80225cc9 ("ip: fix dflt addr
selection for connected nexthop").
Fix that by instead checking the address family of the gateway IP. This
is a more direct way and also consistent with the IPv6 counterpart in
mlxsw_sp_rt6_is_gateway().
Cc: stable@vger.kernel.org Fixes: 7c0d80225cc9 ("ip: fix dflt addr selection for connected nexthop") Fixes: 0b2837f003c1 ("nexthop: Add support for IPv4 nexthops") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Amit Cohen <amcohen@nvidia.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
The cited commit refactored the flow action initialization sequence to
use an interface method when translating tc action instances to flow
offload objects. The refactored version skips the initialization of the
generic flow action attributes for tc actions, such as pedit, that allocate
more than one offload entry. This can cause potential issues for drivers
mapping flow action ids.
Populate the generic flow action fields for all the flow action entries.
Fixes: 62e9a851648a ("flow_offload: add ops to tc_action_ops for flow action setup") Signed-off-by: Oz Shlomo <ozsh@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com>
----
v1 -> v2:
- coalese the generic flow action fields initialization to a single loop Reviewed-by: Baowen Zheng <baowen.zheng@corigine.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix data-races around sysctl_tcp_max_reordering.
While reading sysctl_tcp_max_reordering, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 22c07bfe20b7 ("tcp: allow for bigger reordering level") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix data-races around sysctl_tcp_slow_start_after_idle.
While reading sysctl_tcp_slow_start_after_idle, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 10e7863a3141 ("[TCP]: Add tcp_slow_start_after_idle sysctl.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_thin_linear_timeouts.
While reading sysctl_tcp_thin_linear_timeouts, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: 0abed800963b ("net: TCP thin linear timeouts") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
While reading sysctl_tcp_recovery, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its readers.
Fixes: 9c201ccfcd85 ("tcp: use RACK to detect losses") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: Fix a data-race around sysctl_tcp_early_retrans.
While reading sysctl_tcp_early_retrans, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 8ebd7bee9079 ("tcp: early retransmit") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
udp: Fix a data-race around sysctl_udp_l3mdev_accept.
While reading sysctl_udp_l3mdev_accept, it can be changed concurrently.
Thus, we need to add READ_ONCE() to its reader.
Fixes: 88c87d8d1629 ("net: Avoid receiving packets with an l3mdev on unbound UDP sockets") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
sysctl_ip_prot_sock is accessed concurrently, and there is always a chance
of data-race. So, all readers and writers need some basic protection to
avoid load/store-tearing.
Fixes: 6eb98b4f1505 ("Introduce a sysctl that modifies the value of PROT_SOCK.") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: Fix data-races around sysctl_fib_multipath_hash_fields.
While reading sysctl_fib_multipath_hash_fields, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: 0d4d245e00a2 ("ipv4: Add a sysctl to control multipath hash fields") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: Fix data-races around sysctl_fib_multipath_hash_policy.
While reading sysctl_fib_multipath_hash_policy, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.
Fixes: cde8c3efa529 ("net: ipv4: add support for ECMP hash policy choice") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: Fix a data-race around sysctl_fib_multipath_use_neigh.
While reading sysctl_fib_multipath_use_neigh, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its reader.
Fixes: f596ce6f4c9e ("net: ipv4: Consider failed nexthops in multipath routes") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>