Lee Jones [Thu, 2 Dec 2021 14:34:37 +0000 (14:34 +0000)]
net: cdc_ncm: Allow for dwNtbOutMaxSize to be unset or zero
Currently, due to the sequential use of min_t() and clamp_t() macros,
in cdc_ncm_check_tx_max(), if dwNtbOutMaxSize is not set, the logic
sets tx_max to 0. This is then used to allocate the data area of the
SKB requested later in cdc_ncm_fill_tx_frame().
This does not cause an issue presently because when memory is
allocated during initialisation phase of SKB creation, more memory
(512b) is allocated than is required for the SKB headers alone (320b),
leaving some space (512b - 320b = 192b) for CDC data (172b).
However, if more elements (for example 3 x u64 = [24b]) were added to
one of the SKB header structs, say 'struct skb_shared_info',
increasing its original size (320b [320b aligned]) to something larger
(344b [384b aligned]), then suddenly the CDC data (172b) no longer
fits in the spare SKB data area (512b - 384b = 128b).
Consequently the SKB bounds checking semantics fails and panics:
By overriding the max value with the default CDC_NCM_NTB_MAX_SIZE_TX
when not offered through the system provided params, we ensure enough
data space is allocated to handle the CDC data, meaning no crash will
occur.
Cc: Oliver Neukum <oliver@neukum.org> Fixes: 91c7a41ff6bce ("net: cdc_ncm: use sysfs for rx/tx aggregation tuning") Signed-off-by: Lee Jones <lee.jones@linaro.org> Reviewed-by: Bjørn Mork <bjorn@mork.no> Link: https://lore.kernel.org/r/20211202143437.1411410-1-lee.jones@linaro.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Manish Chopra [Fri, 3 Dec 2021 17:44:13 +0000 (09:44 -0800)]
qede: validate non LSO skb length
Although it is unlikely that stack could transmit a non LSO
skb with length > MTU, however in some cases or environment such
occurrences actually resulted into firmware asserts due to packet
length being greater than the max supported by the device (~9700B).
This patch adds the safeguard for such odd cases to avoid firmware
asserts.
v2: Added "Fixes" tag with one of the initial driver commit
which enabled the TX traffic actually (as this was probably
day1 issue which was discovered recently by some customer
environment)
Fixes: aaf077b4abdc ("qede: Add support for link") Signed-off-by: Manish Chopra <manishc@marvell.com> Signed-off-by: Alok Prasad <palok@marvell.com> Signed-off-by: Prabhakar Kushwaha <pkushwaha@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Link: https://lore.kernel.org/r/20211203174413.13090-1-manishc@marvell.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dan Carpenter [Fri, 3 Dec 2021 10:11:28 +0000 (13:11 +0300)]
net: altera: set a couple error code in probe()
There are two error paths which accidentally return success instead of
a negative error code.
Fixes: eaccad8b1d32 ("Altera TSE: Add main and header file for Altera Ethernet Driver") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Li Zhijian [Fri, 3 Dec 2021 02:32:13 +0000 (10:32 +0800)]
selftests: net/fcnal-test.sh: add exit code
Previously, the selftest framework always treats it as *ok* even though
some of them are failed actually. That's because the script always
returns 0.
It supports PASS/FAIL/SKIP exit code now.
CC: Philip Li <philip.li@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Li Zhijian <zhijianx.li@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
read to 0xffff888157e8ca24 of 4 bytes by task 1082 on cpu 1:
bond_alb_monitor+0x8f/0xc00 drivers/net/bonding/bond_alb.c:1511
process_one_work+0x3fc/0x980 kernel/workqueue.c:2298
worker_thread+0x616/0xa70 kernel/workqueue.c:2445
kthread+0x2c7/0x2e0 kernel/kthread.c:327
ret_from_fork+0x1f/0x30
value changed: 0x00000001 -> 0x00000064
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 1082 Comm: kworker/u4:3 Not tainted 5.16.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: bond1 bond_alb_monitor
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 2 Dec 2021 23:37:24 +0000 (15:37 -0800)]
tcp: fix another uninit-value (sk_rx_queue_mapping)
KMSAN is still not happy [1].
I missed that passive connections do not inherit their
sk_rx_queue_mapping values from the request socket,
but instead tcp_child_process() is calling
sk_mark_napi_id(child, skb)
We have many sk_mark_napi_id() callers, so I am providing
a new helper, forcing the setting sk_rx_queue_mapping
and sk_napi_id.
Note that we had no KMSAN report for sk_napi_id because
passive connections got a copy of this field from the listener.
sk_rx_queue_mapping in the other hand is inside the
sk_dontcopy_begin/sk_dontcopy_end so sk_clone_lock()
leaves this field uninitialized.
We might remove dead code populating req->sk_rx_queue_mapping
in the future.
Fixes: c967b40e84ad ("net: avoid dirtying sk->sk_rx_queue_mapping") Fixes: 0c1e49dfe380 ("net: avoid uninit-value from tcp_conn_request") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Tested-by: Alexander Potapenko <glider@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 2 Dec 2021 22:42:18 +0000 (14:42 -0800)]
inet: use #ifdef CONFIG_SOCK_RX_QUEUE_MAPPING consistently
Since commit 1c88d6f9caf3 ("net/sock: Add kernel config
SOCK_RX_QUEUE_MAPPING"),
sk_rx_queue_mapping access is guarded by CONFIG_SOCK_RX_QUEUE_MAPPING.
Fixes: ac19aa917b24 ("tcp: Migrate TCP_ESTABLISHED/TCP_SYN_RECV sockets in accept queues.") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Tariq Toukan <tariqt@nvidia.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Li Zhijian [Fri, 3 Dec 2021 02:53:22 +0000 (10:53 +0800)]
selftests/tc-testing: add missing config
qdiscs/fq_pie requires CONFIG_NET_SCH_FQ_PIE, otherwise tc will fail
to create a fq_pie qdisc.
It fixes following issue:
# not ok 57 83be - Create FQ-PIE with invalid number of flows
# Command exited with 2, expected 0
# Error: Specified qdisc not found.
Signed-off-by: Li Zhijian <zhijianx.li@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Li Zhijian [Fri, 3 Dec 2021 02:53:21 +0000 (10:53 +0800)]
selftests/tc-testing: add exit code
Mark the summary result as FAIL to prevent from confusing the selftest
framework if some of them are failed.
Previously, the selftest framework always treats it as *ok* even though
some of them are failed actually. That's because the script tdc.sh always
return 0.
# All test results:
#
# 1..97
# ok 1 83be - Create FQ-PIE with invalid number of flows
# ok 2 8b6e - Create RED with no flags
[...snip]
# ok 6 5f15 - Create RED with flags ECN, harddrop
# ok 7 53e8 - Create RED with flags ECN, nodrop
# ok 8 d091 - Fail to create RED with only nodrop flag
# ok 9 af8e - Create RED with flags ECN, nodrop, harddrop
# not ok 10 ce7d - Add mq Qdisc to multi-queue device (4 queues)
# Could not match regex pattern. Verify command output:
# qdisc mq 1: root
# qdisc fq_codel 0: parent 1:4 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
# qdisc fq_codel 0: parent 1:3 limit 10240p flows 1024 quantum 1514 target 5ms interval 100ms memory_limit 32Mb ecn drop_batch 64
[...snip]
# ok 96 6979 - Change quantum of a strict ETS band
# ok 97 9a7d - Change ETS strict band without quantum
#
#
#
#
ok 1 selftests: tc-testing: tdc.sh <<< summary result
CC: Philip Li <philip.li@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Li Zhijian <zhijianx.li@intel.com> Acked-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Peilin Ye [Wed, 1 Dec 2021 00:47:20 +0000 (16:47 -0800)]
selftests/fib_tests: Rework fib_rp_filter_test()
Currently rp_filter tests in fib_tests.sh:fib_rp_filter_test() are
failing. ping sockets are bound to dummy1 using the "-I" option
(SO_BINDTODEVICE), but socket lookup is failing when receiving ping
replies, since the routing table thinks they belong to dummy0.
For example, suppose ping is using a SOCK_RAW socket for ICMP messages.
When receiving ping replies, in __raw_v4_lookup(), sk->sk_bound_dev_if
is 3 (dummy1), but dif (skb_rtable(skb)->rt_iif) says 2 (dummy0), so the
raw_sk_bound_dev_eq() check fails. Similar things happen in
ping_lookup() for SOCK_DGRAM sockets.
These tests used to pass due to a bug [1] in iputils, where "ping -I"
actually did not bind ICMP message sockets to device. The bug has been
fixed by iputils commit f455fee41c07 ("ping: also bind the ICMP socket
to the specific device") in 2016, which is why our rp_filter tests
started to fail. See [2] .
Fixing the tests while keeping everything in one netns turns out to be
nontrivial. Rework the tests and build the following topology:
Consider sending an ICMP_ECHO packet A in ns2. Both source and
destination IP addresses are 192.0.2.1, and we use strict mode rp_filter
in both ns1 and ns2:
1. A is routed to lo since its destination IP address is one of ns2's
local addresses (veth2);
2. A is redirected from lo's egress to veth2's egress using mirred;
3. A arrives at veth1's ingress in ns1;
4. A is redirected from veth1's ingress to lo's ingress, again, using
mirred;
5. In __fib_validate_source(), fib_info_nh_uses_dev() returns false,
since A was received on lo, but reverse path lookup says veth1;
6. However A is not dropped since we have relaxed this check for lo in
commit 3f99d940b90b ("fib: relax source validation check for loopback
packets");
Making sure A is not dropped here in this corner case is the whole point
of having this test.
7. As A reaches the ICMP layer, an ICMP_ECHOREPLY packet, B, is
generated;
8. Similarly, B is redirected from lo's egress to veth1's egress (in
ns1), then redirected once again from veth2's ingress to lo's
ingress (in ns2), using mirred.
Also test "ping 127.0.0.1" from ns2. It does not trigger the relaxed
check in __fib_validate_source(), but just to make sure the topology
works with loopback addresses.
Reported-by: Hangbin Liu <liuhangbin@gmail.com> Fixes: f3beae7cdf9a ("selftests: add a test case for rp_filter") Reviewed-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Acked-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20211201004720.6357-1-yepeilin.cs@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- rds: correct socket tunable error in rds_tcp_tune()
- ipv6: fix memory leak in fib6_rule_suppress
- wireguard: reset peer src endpoint when netns exits
- wireguard: improve resilience to DoS around incoming handshakes
- tcp: fix page frag corruption on page fault which involves TCP
- mpls: fix missing attributes in delete notifications
- mt7915: fix NULL pointer dereference with ad-hoc mode
Misc:
- rt2x00: be more lenient about EPROTO errors during start
- mlx4_en: update reported link modes for 1/10G"
* tag 'net-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (85 commits)
net: dsa: b53: Add SPI ID table
gro: Fix inconsistent indenting
selftests: net: Correct case name
net/rds: correct socket tunable error in rds_tcp_tune()
mctp: Don't let RTM_DELROUTE delete local routes
net/smc: Keep smc_close_final rc during active close
ibmvnic: drop bad optimization in reuse_tx_pools()
ibmvnic: drop bad optimization in reuse_rx_pools()
net/smc: fix wrong list_del in smc_lgr_cleanup_early
Fix Comment of ETH_P_802_3_MIN
ethernet: aquantia: Try MAC address from device tree
ipv4: convert fib_num_tclassid_users to atomic_t
net: avoid uninit-value from tcp_conn_request
net: annotate data-races on txq->xmit_lock_owner
octeontx2-af: Fix a memleak bug in rvu_mbox_init()
net/mlx4_en: Fix an use-after-free bug in mlx4_en_try_alloc_resources()
vrf: Reset IPCB/IP6CB when processing outbound pkts in vrf dev xmit
net: qlogic: qlcnic: Fix a NULL pointer dereference in qlcnic_83xx_add_rings()
net: dsa: mv88e6xxx: Link in pcs_get_state() if AN is bypassed
net: dsa: mv88e6xxx: Fix inband AN for 2500base-x on 88E6393X family
...
Linus Torvalds [Thu, 2 Dec 2021 19:07:41 +0000 (11:07 -0800)]
Merge tag 'trace-v5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Three tracing fixes:
- Allow compares of strings when using signed and unsigned characters
- Fix kmemleak false positive for histogram entries
- Handle negative numbers for user defined kretprobe data sizes"
* tag 'trace-v5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
kprobes: Limit max data_size of the kretprobe instances
tracing: Fix a kmemleak false positive in tracing_map
tracing/histograms: String compares should not care about signed values
Linus Torvalds [Thu, 2 Dec 2021 18:56:16 +0000 (10:56 -0800)]
Merge tag 'for-linus-5.16-2' of git://github.com/cminyard/linux-ipmi
Pull IPMI fixes from Corey Minyard:
"Some changes that went in 5.16 had issues. When working on the design
a piece was redesigned and things got missed. And the message type was
not being initialized when it was allocated, resulting in crashes.
In addition, the IPMI driver has had a shutdown issue where it could
still have an item in a system workqueue after it had been shutdown.
Move to a private workqueue to avoid that problem"
* tag 'for-linus-5.16-2' of git://github.com/cminyard/linux-ipmi:
ipmi:ipmb: Fix unknown command response
ipmi: fix IPMI_SMI_MSG_TYPE_IPMB_DIRECT response length checking
ipmi: fix oob access due to uninit smi_msg type
ipmi: msghandler: Make symbol 'remove_work_wq' static
ipmi: Move remove_work to dedicated workqueue
Currently autoloading for SPI devices does not use the DT ID table, it
uses SPI modalises. Supporting OF modalises is going to be difficult if
not impractical, an attempt was made but has been reverted, so ensure
that module autoloading works for this driver by adding an id_table
listing the SPI IDs for everything.
Fixes: 0ce7936ceffc ("spi: Revert modalias changes") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Li Zhijian [Thu, 2 Dec 2021 02:28:41 +0000 (10:28 +0800)]
selftests: net: Correct case name
ipv6_addr_bind/ipv4_addr_bind are function names. Previously, bind test
would not be run by default due to the wrong case names
Fixes: b249bfc2296e ("selftests: Add ipv6 address bind tests to fcnal-test") Fixes: 5d4a6c65516f ("selftests: Add ipv4 address bind tests to fcnal-test") Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/rds: correct socket tunable error in rds_tcp_tune()
Correct an error where setting /proc/sys/net/rds/tcp/rds_tcp_rcvbuf would
instead modify the socket's sk_sndbuf and would leave sk_rcvbuf untouched.
Fixes: 260a1efff476 ("RDS: TCP: Add sysctl tunables for sndbuf/rcvbuf on rds-tcp socket") Signed-off-by: William Kucharski <william.kucharski@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Johnston [Wed, 1 Dec 2021 08:07:42 +0000 (16:07 +0800)]
mctp: Don't let RTM_DELROUTE delete local routes
We need to test against the existing route type, not
the rtm_type in the netlink request.
Fixes: 200c1b0d4752 ("mctp: Specify route types, require rtm_type in RTM_*ROUTE messages") Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Lu [Wed, 1 Dec 2021 06:42:16 +0000 (14:42 +0800)]
net/smc: Keep smc_close_final rc during active close
When smc_close_final() returns error, the return code overwrites by
kernel_sock_shutdown() in smc_close_active(). The return code of
smc_close_final() is more important than kernel_sock_shutdown(), and it
will pass to userspace directly.
Fix it by keeping both return codes, if smc_close_final() raises an
error, return it or kernel_sock_shutdown()'s.
Link: https://lore.kernel.org/linux-s390/1f67548e-cbf6-0dce-82b5-10288a4583bd@linux.ibm.com/ Fixes: 5b5d5453db58 ("net/smc: Ensure the active closing peer first closes clcsock") Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ibmvnic: drop bad optimization in reuse_tx_pools()
When trying to decide whether or not reuse existing rx/tx pools
we tried to allow a range of values for the pool parameters rather
than exact matches. This was intended to reuse the resources for
instance when switching between two VIO servers with different
default parameters.
But this optimization is incomplete and breaks when we try to
change the number of queues for instance. The optimization needs
to be updated, so drop it for now and simplify the code.
Fixes: 7516c9974ece ("ibmvnic: Reuse tx pools when possible") Reported-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ibmvnic: drop bad optimization in reuse_rx_pools()
When trying to decide whether or not reuse existing rx/tx pools
we tried to allow a range of values for the pool parameters rather
than exact matches. This was intended to reuse the resources for
instance when switching between two VIO servers with different
default parameters.
But this optimization is incomplete and breaks when we try to
change the number of queues for instance. The optimization needs
to be updated, so drop it for now and simplify the code.
Fixes: 48182133bac4 ("ibmvnic: Reuse rx pools when possible") Reported-by: Dany Madden <drt@linux.ibm.com> Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.ibm.com> Reviewed-by: Dany Madden <drt@linux.ibm.com> Reviewed-by: Rick Lindsley <ricklind@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dust Li [Wed, 1 Dec 2021 03:02:30 +0000 (11:02 +0800)]
net/smc: fix wrong list_del in smc_lgr_cleanup_early
smc_lgr_cleanup_early() meant to delete the link
group from the link group list, but it deleted
the list head by mistake.
This may cause memory corruption since we didn't
remove the real link group from the list and later
memseted the link group structure.
We got a list corruption panic when testing:
Fixes: bb81994a1043a ("net/smc: separate locks for SMCD and SMCR link group lists") Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xiayu Zhang [Wed, 1 Dec 2021 02:57:13 +0000 (10:57 +0800)]
Fix Comment of ETH_P_802_3_MIN
The description of ETH_P_802_3_MIN is misleading.
The value of EthernetType in Ethernet II frame is more than 0x0600,
the value of Length in 802.3 frame is less than 0x0600.
Signed-off-by: Xiayu Zhang <Xiayu.Zhang@mediatek.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tianhao Chai [Wed, 1 Dec 2021 02:57:06 +0000 (20:57 -0600)]
ethernet: aquantia: Try MAC address from device tree
Apple M1 Mac minis (2020) with 10GE NICs do not have MAC address in the
card, but instead need to obtain MAC addresses from the device tree. In
this case the hardware will report an invalid MAC.
Currently atlantic driver does not query the DT for MAC address and will
randomly assign a MAC if the NIC doesn't have a permanent MAC burnt in.
This patch causes the driver to perfer a valid MAC address from OF (if
present) over HW self-reported MAC and only fall back to a random MAC
address when neither of them is valid.
Signed-off-by: Tianhao Chai <cth451@gmail.com> Reviewed-by: Igor Russkikh <irusskikh@marvell.com> Reviewed-by: Hector Martin <marcan@marcan.st> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 2 Dec 2021 02:26:35 +0000 (18:26 -0800)]
ipv4: convert fib_num_tclassid_users to atomic_t
Before commit fffe6e0e5011 ("ipv4: Create cleanup helper for fib_nh")
changes to net->ipv4.fib_num_tclassid_users were protected by RTNL.
After the change, this is no longer the case, as free_fib_info_rcu()
runs after rcu grace period, without rtnl being held.
Fixes: fffe6e0e5011 ("ipv4: Create cleanup helper for fib_nh") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Ahern <dsahern@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 28712 Comm: syz-executor.0 Tainted: G W 5.16.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Zhou Qingyang [Tue, 30 Nov 2021 16:50:39 +0000 (00:50 +0800)]
octeontx2-af: Fix a memleak bug in rvu_mbox_init()
In rvu_mbox_init(), mbox_regions is not freed or passed out
under the switch-default region, which could lead to a memory leak.
Fix this bug by changing 'return err' to 'goto free_regions'.
This bug was found by a static analyzer. The analysis employs
differential checking to identify inconsistent security operations
(e.g., checks or kfrees) between two code paths and confirms that the
inconsistent operations are not recovered in the current function or
the callers, so they constitute bugs.
Note that, as a bug found by static analysis, it can be a false
positive or hard to trigger. Multiple researchers have cross-reviewed
the bug.
Builds with CONFIG_OCTEONTX2_AF=y show no new warnings,
and our static analyzer no longer warns about this code.
Zhou Qingyang [Tue, 30 Nov 2021 16:44:38 +0000 (00:44 +0800)]
net/mlx4_en: Fix an use-after-free bug in mlx4_en_try_alloc_resources()
In mlx4_en_try_alloc_resources(), mlx4_en_copy_priv() is called and
tmp->tx_cq will be freed on the error path of mlx4_en_copy_priv().
After that mlx4_en_alloc_resources() is called and there is a dereference
of &tmp->tx_cq[t][i] in mlx4_en_alloc_resources(), which could lead to
a use after free problem on failure of mlx4_en_copy_priv().
Fix this bug by adding a check of mlx4_en_copy_priv()
This bug was found by a static analyzer. The analysis employs
differential checking to identify inconsistent security operations
(e.g., checks or kfrees) between two code paths and confirms that the
inconsistent operations are not recovered in the current function or
the callers, so they constitute bugs.
Note that, as a bug found by static analysis, it can be a false
positive or hard to trigger. Multiple researchers have cross-reviewed
the bug.
Builds with CONFIG_MLX4_EN=m show no new warnings,
and our static analyzer no longer warns about this code.
Fixes: 9d0750234dbf ("net/mlx4_en: Add resilience in low memory systems") Signed-off-by: Zhou Qingyang <zhou1615@umn.edu> Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Link: https://lore.kernel.org/r/20211130164438.190591-1-zhou1615@umn.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
vrf: Reset IPCB/IP6CB when processing outbound pkts in vrf dev xmit
IPCB/IP6CB need to be initialized when processing outbound v4 or v6 pkts
in the codepath of vrf device xmit function so that leftover garbage
doesn't cause futher code that uses the CB to incorrectly process the
pkt.
One occasion of the issue might occur when MPLS route uses the vrf
device as the outgoing device such as when the route is added using "ip
-f mpls route add <label> dev <vrf>" command.
The problems seems to exist since day one. Hence I put the day one
commits on the Fixes tags.
Fixes: 9c5f942f99b3 ("net: Introduce VRF device driver") Fixes: d338be79af90 ("net: Add IPv6 support to VRF device") Cc: stable@vger.kernel.org Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20211130162637.3249-1-ssuryaextr@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Zhou Qingyang [Tue, 30 Nov 2021 11:08:48 +0000 (19:08 +0800)]
net: qlogic: qlcnic: Fix a NULL pointer dereference in qlcnic_83xx_add_rings()
In qlcnic_83xx_add_rings(), the indirect function of
ahw->hw_ops->alloc_mbx_args will be called to allocate memory for
cmd.req.arg, and there is a dereference of it in qlcnic_83xx_add_rings(),
which could lead to a NULL pointer dereference on failure of the
indirect function like qlcnic_83xx_alloc_mbx_args().
Fix this bug by adding a check of alloc_mbx_args(), this patch
imitates the logic of mbx_cmd()'s failure handling.
This bug was found by a static analyzer. The analysis employs
differential checking to identify inconsistent security operations
(e.g., checks or kfrees) between two code paths and confirms that the
inconsistent operations are not recovered in the current function or
the callers, so they constitute bugs.
Note that, as a bug found by static analysis, it can be a false
positive or hard to trigger. Multiple researchers have cross-reviewed
the bug.
Builds with CONFIG_QLCNIC=m show no new warnings, and our
static analyzer no longer warns about this code.
kprobes: Limit max data_size of the kretprobe instances
The 'kprobe::data_size' is unsigned, thus it can not be negative. But if
user sets it enough big number (e.g. (size_t)-8), the result of 'data_size
+ sizeof(struct kretprobe_instance)' becomes smaller than sizeof(struct
kretprobe_instance) or zero. In result, the kretprobe_instance are
allocated without enough memory, and kretprobe accesses outside of
allocated memory.
To avoid this issue, introduce a max limitation of the
kretprobe::data_size. 4KB per instance should be OK.
The reason is elts->pages[i] is alloced by get_zeroed_page.
and kmemleak will not scan the area alloced by get_zeroed_page.
The address stored in elts->pages will be regarded as leaked.
That is, the elts->pages[i] will have pointers loaded onto it as well, and
without telling kmemleak about it, those pointers will look like memory
without a reference.
To fix this, call kmemleak_alloc to tell kmemleak to scan elts->pages[i]
tracing/histograms: String compares should not care about signed values
When comparing two strings for the "onmatch" histogram trigger, fields
that are strings use string comparisons, which do not care about being
signed or not.
Do not fail to match two string fields if one is unsigned char array and
the other is a signed char array.
Link: https://lore.kernel.org/all/20211129123043.5cfd687a@gandalf.local.home/ Cc: stable@vgerk.kernel.org Cc: Tom Zanussi <zanussi@kernel.org> Cc: Yafang Shao <laoar.shao@gmail.com> Fixes: beb9492445390 ("tracing: Accept different type for synthetic event fields") Reviewed-by: Masami Hiramatsu <mhiramatsu@kernel.org> Reported-by: Sven Schnelle <svens@linux.ibm.com> Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Linus Torvalds [Wed, 1 Dec 2021 18:07:39 +0000 (10:07 -0800)]
Merge tag 'sound-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of small fixes. A large series is found for ASoC tegra
drivers to correct the control element handlings, while others are
mostly for device-specific quirks and fix-ups"
* tag 'sound-5.16-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (25 commits)
ALSA: hda/hdmi: fix HDA codec entry table order for ADL-P
ALSA: hda: Add Intel DG2 PCI ID and HDMI codec vid
ALSA: hda/cs8409: Set PMSG_ON earlier inside cs8409 driver
ASoC: SOF: hda: reset DAI widget before reconfiguring it
ASoC: cs35l41: Set the max SPI speed for the whole device
ALSA: intel-dsp-config: add quirk for CML devices based on ES8336 codec
ASoC: Intel: soc-acpi: add entry for ESSX8336 on CML
ASoC: rk817: Add module alias for rk817-codec
ASoC: soc-acpi: Set mach->id field on comp_ids matches
ASoC: tegra: Fix kcontrol put callback in Mixer
ASoC: tegra: Fix kcontrol put callback in ADX
ASoC: tegra: Fix kcontrol put callback in AMX
ASoC: tegra: Fix kcontrol put callback in SFC
ASoC: tegra: Fix kcontrol put callback in MVC
ASoC: tegra: Fix kcontrol put callback in AHUB
ASoC: tegra: Fix kcontrol put callback in DSPK
ASoC: tegra: Fix kcontrol put callback in DMIC
ASoC: tegra: Fix kcontrol put callback in I2S
ASoC: tegra: Fix kcontrol put callback in ADMAIF
ASoC: tegra: Fix wrong value type in MVC
...
So I managed to discovered how to fix inband AN for 2500base-x mode on
88E6393x (Amethyst) family.
This series fixes application of erratum 4.8, adds fix for erratum 5.2,
adds support for completely disablign SerDes receiver / transmitter,
fixes inband AN for 2500base-x mode by using 1000base-x mode and simply
changing frequeny to 3.125 GHz, all this for 88E6393X.
The last commit fixes linking when link partner has AN disabled and the
device invokes the AN bypass feature. Currently we fail to link in this
case.
Changes since v1:
- fixed wrong operator in patch 3 (thanks Russell)
- added more comments about why BMCR_ANENABLE is used in patch 6 (thanks
Russell)
- updated some return statements from
if (something)
return func();
return 0;
to
if (something)
err = func();
return err;
(err is set to 0 before the condition)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Marek Behún [Tue, 30 Nov 2021 17:01:51 +0000 (18:01 +0100)]
net: dsa: mv88e6xxx: Link in pcs_get_state() if AN is bypassed
Function mv88e6xxx_serdes_pcs_get_state() currently does not report link
up if AN is enabled, Link bit is set, but Speed and Duplex Resolved bit
is not set, which testing shows is the case for when auto-negotiation
was bypassed (we have AN enabled but link partner does not).
An example of such link partner is Marvell 88X3310 PHY, when put into
the mode where host interface changes between 10gbase-r, 5gbase-r,
2500base-x and sgmii according to copper speed. The 88X3310 does not
enable AN in 2500base-x, and so SerDes on mv88e6xxx currently does not
link with it.
Fix this.
Fixes: 860d703ec5b2 ("net: dsa: mv88e6xxx: extend phylink to Serdes PHYs") Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Marek Behún [Tue, 30 Nov 2021 17:01:50 +0000 (18:01 +0100)]
net: dsa: mv88e6xxx: Fix inband AN for 2500base-x on 88E6393X family
Inband AN is broken on Amethyst in 2500base-x mode when set by standard
mechanism (via cmode).
(There probably is some weird setting done by default in the switch for
this mode that make it cycle in some state or something, because when
the peer is the mvneta controller, it receives link change interrupts
every ~0.3ms, but the link is always down.)
Get around this by configuring the PCS mode to 1000base-x (where inband
AN works), and then changing the SerDes frequency while SerDes
transmitter and receiver are disabled, before enabling SerDes PHY. After
disabling SerDes PHY, change the PCS mode back to 2500base-x, to avoid
confusing the device (if we leave it at 1000base-x PCS mode but with
different frequency, and then change cmode to sgmii, the device won't
change the frequency because it thinks it already has the correct one).
The register which changes the frequency is undocumented. I discovered
it by going through all registers in the ranges 4.f000-4.f100 and
1e.8000-1e.8200 for all SerDes cmodes (sgmii, 1000base-x, 2500base-x,
5gbase-r, 10gbase-r, usxgmii) and filtering out registers that didn't
make sense (the value was the same for modes which have different
frequency). The result of this was:
Register 04.f002 is the documented Port Operational Confiuration
register, it's last 3 bits select PCS type, so changing this register
also changes the frequency to the appropriate value.
Registers 04.f076 and 04.f07c are not writable.
Undocumented register 1e.8000 was the one: changing bits 3:0 from 9 to 8
changed SerDes frequency to 3.125 GHz, while leaving the value of PCS
mode in register 04.f002.2:0 at 1000base-x. Inband autonegotiation
started working correctly.
(I didn't try anything with register 1e.8140 since 1e.8000 solved the
problem.)
Since I don't have documentation for this register 1e.8000.3:0, I am
using the constants without names, but my hypothesis is that this
register selects PHY frequency. If in the future I have access to an
oscilloscope able to handle these frequencies, I will try to test this
hypothesis.
Fixes: 5fe9fbda56e8 ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Marek Behún [Tue, 30 Nov 2021 17:01:49 +0000 (18:01 +0100)]
net: dsa: mv88e6xxx: Add fix for erratum 5.2 of 88E6393X family
Add fix for erratum 5.2 of the 88E6393X (Amethyst) family: for 10gbase-r
mode, some undocumented registers need to be written some special
values.
Fixes: 5fe9fbda56e8 ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Marek Behún [Tue, 30 Nov 2021 17:01:48 +0000 (18:01 +0100)]
net: dsa: mv88e6xxx: Save power by disabling SerDes trasmitter and receiver
Save power on 88E6393X by disabling SerDes receiver and transmitter
after SerDes is SerDes is disabled.
Signed-off-by: Marek Behún <kabel@kernel.org> Cc: stable@vger.kernel.org # 5fe9fbda56e8 ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: David S. Miller <davem@davemloft.net>
Marek Behún [Tue, 30 Nov 2021 17:01:46 +0000 (18:01 +0100)]
net: dsa: mv88e6xxx: Fix application of erratum 4.8 for 88E6393X
According to SERDES scripts for 88E6393X, erratum 4.8 has to be applied
every time before SerDes is powered on.
Split the code for erratum 4.8 into separate function and call it in
mv88e6393x_serdes_power().
Fixes: 5fe9fbda56e8 ("net: dsa: mv88e6xxx: add support for mv88e6393x family") Signed-off-by: Marek Behún <kabel@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Ben Ben-Ishay [Sun, 31 Oct 2021 16:31:02 +0000 (18:31 +0200)]
net/mlx5e: SHAMPO, Fix constant expression result
mlx5e_build_shampo_hd_umr uses counters i and index incorrectly
as unsigned, thus the err state err_unmap could stuck in endless loop.
Change i to int to solve the first issue.
Reduce index check to solve the second issue, the caller function
validates that index could not rotate.
Fixes: 7de3d36bcd47 ("net/mlx5e: Add data path for SHAMPO feature") Signed-off-by: Ben Ben-Ishay <benishay@nvidia.com> Reviewed-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Amir Tzin [Wed, 20 Oct 2021 09:45:05 +0000 (12:45 +0300)]
net/mlx5: Fix use after free in mlx5_health_wait_pci_up
The device health recovery flow calls mlx5_health_wait_pci_up() which
queries the device for FW_RESET timeout after freeing the device
timeouts structure on mlx5_function_teardown(). Fix this bug by moving
timeouts structure init/cleanup to the device's init/uninit phases.
Since it is necessary to reset default software timeouts on function
reload, extract setting of defaults values from mlx5_tout_init() and
call mlx5_tout_set_def_val() directly from mlx5_function_setup().
Maor Dickman [Tue, 23 Nov 2021 12:37:11 +0000 (14:37 +0200)]
net/mlx5: E-Switch, Use indirect table only if all destinations support it
When adding rule with multiple destinations, indirect table is used for all of
the destinations if at least one of the destinations support it, this can cause
creation of invalid indirect tables for the destinations that doesn't support it.
Fixed it by using indirect table only if all destinations support it.
Mark Bloch [Thu, 21 Oct 2021 12:46:17 +0000 (12:46 +0000)]
net/mlx5: E-Switch, fix single FDB creation on BlueField
Always use MLX5_FLOW_TABLE_OTHER_VPORT flag when creating egress ACL
table for single FDB. Not doing so on BlueField will make firmware fail
the command. On BlueField the E-Switch manager is the ECPF (vport 0xFFFE)
which is filled in the flow table creation command but as the
other_vport field wasn't set the firmware complains about a bad parameter.
This is different from a regular HCA where the E-Switch manager vport is
the PF (vport 0x0). Passing MLX5_FLOW_TABLE_OTHER_VPORT will make the
firmware happy both on BlueField and on regular HCAs without special
condition for each.
This fixes the bellow firmware syndrome:
mlx5_cmd_check:819:(pid 571): CREATE_FLOW_TABLE(0x930) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x754a4)
Fixes: 55b0701c0551 ("net/mlx5: E-Switch, add logic to enable shared FDB") Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Dmytro Linkin [Tue, 21 Sep 2021 12:47:33 +0000 (15:47 +0300)]
net/mlx5: E-switch, Respect BW share of the new group
To enable transmit schduler on vport FW require non-zero configuration
for vport's TSAR. If vport added to the group which has configured BW
share value and TX rate values of the vport are zero, then scheduler
wouldn't be enabled on this vport.
Fix that by calling BW normalization if BW share of the new group is
configured.
Fixes: f6b97bcb1b0d ("net/mlx5: E-switch, Allow to add vports to rate groups") Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Reviewed-by: Parav Pandit <parav@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Maor Gottlieb [Thu, 18 Nov 2021 10:29:15 +0000 (12:29 +0200)]
net/mlx5: Lag, Fix recreation of VF LAG
Driver needs to nullify the port select attributes of the LAG when
port selection is destroyed, otherwise it breaks recreation of the
LAG.
It fixes the below kernel oops:
Fixes: a5b3c1ca5f2f ("net/mlx5: Lag, add support to create/destroy/modify port selection") Signed-off-by: Maor Gottlieb <maorg@nvidia.com> Reviewed-by: Mark Bloch <mbloch@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
net/mlx5e: Sync TIR params updates against concurrent create/modify
Transport Interface Receive (TIR) objects perform the packet processing and
reassembly and is also responsible for demultiplexing the packets into the
different RQs.
There are certain TIR context attributes that propagate to the pointed RQs
and applied to them (like packet_merge offloads (LRO/SHAMPO) and
tunneled_offload_en). When TIRs do not agree on attributes values, a "last
one wins" policy is applied. Hence, if not synced properly, a race between
TIR params update and a concurrent TIR create/modify operation might yield
to a mismatch between the shadow parameters in SW and the actual applied
state of the RQs in HW.
tunneled_offload_en is a fixed attribute per profile, while packet merge
offload state might be toggled and get out-of-sync. When this happens,
packet_merge offload might be working although not requested, or the
opposite.
All updates to packet_merge state and all create/modify operations of
regular redirection/steering TIRs are done under the same priv->state_lock,
so they do not run in parallel, and no race is possible.
However, there are other kind of TIRs (acceleration offloads TIRs, like TLS
TIRs) which are created on demand for each new connection without holding
the coarse priv->state_lock, hence might race.
Fix this by synchronizing all packet_merge state reads and writes against
all TIR create/modify operations. Include the modify operations of the
regular redirection steering TIRs under the new lock, for better code
layering and division of responsibilities.
Raed Salem [Thu, 8 Jul 2021 09:48:24 +0000 (12:48 +0300)]
net/mlx5e: Fix missing IPsec statistics on uplink representor
The cited patch added the IPsec support to uplink representor, however
as uplink representors have his private statistics where IPsec stats
is not part of it, that effectively makes IPsec stats hidden when uplink
representor stats queried.
Resolve by adding IPsec stats to uplink representor private statistics.
Fixes: b9faaa0f5b6c ("net/mlx5e: Add IPsec support to uplink representor") Signed-off-by: Raed Salem <raeds@nvidia.com> Reviewed-by: Alaa Hleihel <alaa@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Raed Salem [Wed, 17 Nov 2021 11:33:57 +0000 (13:33 +0200)]
net/mlx5e: IPsec: Fix Software parser inner l3 type setting in case of encapsulation
Current code wrongly uses the skb->protocol field which reflects the
outer l3 protocol to set the inner l3 type in Software Parser (SWP)
fields settings in the ethernet segment (eseg) in flows where inner
l3 exists like in Vxlan over ESP flow, the above method wrongly use
the outer protocol type instead of the inner one. thus breaking cases
where inner and outer headers have different protocols.
Fix by setting the inner l3 type in SWP according to the inner l3 ip
header version.
Randy Dunlap [Tue, 30 Nov 2021 06:39:47 +0000 (22:39 -0800)]
natsemi: xtensa: fix section mismatch warnings
Fix section mismatch warnings in xtsonic. The first one appears to be
bogus and after fixing the second one, the first one is gone.
WARNING: modpost: vmlinux.o(.text+0x529adc): Section mismatch in reference from the function sonic_get_stats() to the function .init.text:set_reset_devices()
The function sonic_get_stats() references
the function __init set_reset_devices().
This is often because sonic_get_stats lacks a __init
annotation or the annotation of set_reset_devices is wrong.
WARNING: modpost: vmlinux.o(.text+0x529b3b): Section mismatch in reference from the function xtsonic_probe() to the function .init.text:sonic_probe1()
The function xtsonic_probe() references
the function __init sonic_probe1().
This is often because xtsonic_probe lacks a __init
annotation or the annotation of sonic_probe1 is wrong.
Fixes: 01b799412c82 ("xtensa: Add support for the Sonic Ethernet device for the XT2000 board.") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Cc: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Cc: Finn Thain <fthain@telegraphics.com.au> Cc: Chris Zankel <chris@zankel.net> Cc: linux-xtensa@linux-xtensa.org Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Acked-by: Max Filippov <jcmvbkbc@gmail.com> Link: https://lore.kernel.org/r/20211130063947.7529-1-rdunlap@infradead.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: netlink: af_netlink: Prevent empty skb by adding a check on len.
Adding a check on len parameter to avoid empty skb. This prevents a
division error in netem_enqueue function which is caused when skb->len=0
and skb->data_len=0 in the randomized corruption step as shown below.
random.c is a bit understaffed, and folks want more prompt reviews. I've
got the crypto background and the interest to do these reviews, and have
authored parts of the file already.
Cc: Theodore Ts'o <tytso@mit.edu> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Tue, 30 Nov 2021 17:22:15 +0000 (09:22 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"ARM64:
- Fix constant sign extension affecting TCR_EL2 and preventing
running on ARMv8.7 models due to spurious bits being set
- Fix use of helpers using PSTATE early on exit by always sampling it
as soon as the exit takes place
- Move pkvm's 32bit handling into a common helper
RISC-V:
- Fix incorrect KVM_MAX_VCPUS value
- Unmap stage2 mapping when deleting/moving a memslot
x86:
- Fix and downgrade BUG_ON due to uninitialized cache
- Many APICv and MOVE_ENC_CONTEXT_FROM fixes
- Correctly emulate TLB flushes around nested vmentry/vmexit and when
the nested hypervisor uses VPID
- Prevent modifications to CPUID after the VM has run
- Other smaller bugfixes
Generic:
- Memslot handling bugfixes"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (44 commits)
KVM: fix avic_set_running for preemptable kernels
KVM: VMX: clear vmx_x86_ops.sync_pir_to_irr if APICv is disabled
KVM: SEV: accept signals in sev_lock_two_vms
KVM: SEV: do not take kvm->lock when destroying
KVM: SEV: Prohibit migration of a VM that has mirrors
KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked
selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
KVM: SEV: initialize regions_list of a mirror VM
KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
KVM: SEV: do not use list_replace_init on an empty list
KVM: x86: Use a stable condition around all VT-d PI paths
KVM: x86: check PIR even for vCPUs with disabled APICv
KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem
KVM: x86/mmu: Handle "default" period when selectively waking kthread
KVM: MMU: shadow nested paging does not have PKU
KVM: x86/mmu: Remove spurious TLB flushes in TDP MMU zap collapsible path
KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
KVM: X86: Use vcpu->arch.walk_mmu for kvm_mmu_invlpg()
...
Commit a6fa1762fed1 ("include/linux/radix-tree.h: replace kernel.h with
the necessary inclusions") broke the radix tree test suite in two
different ways; first by including math.h which didn't exist in the
tools directory, and second by removing an implicit include of
spinlock.h before lockdep.h. Fix both issues.
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Paolo Bonzini [Tue, 30 Nov 2021 08:46:07 +0000 (03:46 -0500)]
KVM: fix avic_set_running for preemptable kernels
avic_set_running() passes the current CPU to avic_vcpu_load(), albeit
via vcpu->cpu rather than smp_processor_id(). If the thread is migrated
while avic_set_running runs, the call to avic_vcpu_load() can use a stale
value for the processor id. Avoid this by blocking preemption over the
entire execution of avic_set_running().
Reported-by: Sean Christopherson <seanjc@google.com> Fixes: af9badde83f5 ("svm: Manage vcpu load/unload when enable AVIC") Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Karsten Graul [Tue, 30 Nov 2021 07:33:58 +0000 (08:33 +0100)]
MAINTAINERS: s390/net: add Alexandra and Wenjia as maintainer
Add Alexandra and Wenjia as maintainers for drivers/s390/net and iucv.
Also, remove myself as maintainer for these areas.
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Acked-by: Alexandra Winter <wintera@linux.ibm.com> Acked-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dongliang Mu [Tue, 30 Nov 2021 04:05:54 +0000 (12:05 +0800)]
dpaa2-eth: destroy workqueue at the end of remove function
The commit 9289bbe02a82 ("dpaa2-eth: support PTP Sync packet one-step
timestamping") forgets to destroy workqueue at the end of remove
function.
Fix this by adding destroy_workqueue before fsl_mc_portal_free and
free_netdev.
Fixes: 9289bbe02a82 ("dpaa2-eth: support PTP Sync packet one-step timestamping") Signed-off-by: Dongliang Mu <mudongliangabcd@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ice: xsk: clear status_error0 for each allocated desc
Fix a bug in which the receiving of packets can stop in the zero-copy
driver. Ice HW ignores 3 lower bits from QRX_TAIL register, which means
that tail is bumped only on intervals of 8. Currently with XSK RX
batching in place, ice_alloc_rx_bufs_zc() clears the status_error0 only
of the last descriptor that has been allocated/taken from the XSK buffer
pool. status_error0 includes DD bit that is looked upon by the
ice_clean_rx_irq_zc() to tell if a descriptor can be processed.
The bug can be triggered when driver updates the ntu but not the
QRX_TAIL, so HW wouldn't have a chance to write to the ready
descriptors. Later on driver moves the ntc to the mentioned set of
descriptors and interprets them as a ready to be processed, since
corresponding DD bits were not cleared nor any writeback has happened
that would clear it. This can then lead to ntc == ntu case which means
that ring is empty and no further packet processing.
Fix the XSK traffic hang that can be observed when l2fwd scenario from
xdpsock is used by making sure that status_error0 is cleared for each
descriptor that is fed to HW and therefore we are sure that driver will
not processed non-valid DD bits. This will also prevent the driver from
processing the descriptors that were allocated in favor of the
previously processed ones, but writeback didn't happen yet.
Fixes: a676410fb696 ("ice: Use the xsk batched rx allocation interface") Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Alexander Lobakin <alexandr.lobakin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: marvell: mvpp2: Fix the computation of shared CPUs
'bitmap_fill()' fills a bitmap one 'long' at a time.
It is likely that an exact number of bits is expected.
Use 'bitmap_set()' instead in order not to set unexpected bits.
Fixes: bf2889aeeadc ("net: mvpp2: handle cases where more CPUs are available than s/w threads") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:36 +0000 (19:50 -0500)]
KVM: SEV: accept signals in sev_lock_two_vms
Generally, kvm->lock is not taken for a long time, but
sev_lock_two_vms is different: it takes vCPU locks
inside, so userspace can hold it back just by calling
a vCPU ioctl. Play it safe and use mutex_lock_killable.
Message-Id: <20211123005036.2954379-13-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:35 +0000 (19:50 -0500)]
KVM: SEV: do not take kvm->lock when destroying
Taking the lock is useless since there are no other references,
and there are already accesses (e.g. to sev->enc_context_owner)
that do not take it. So get rid of it.
Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-12-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:34 +0000 (19:50 -0500)]
KVM: SEV: Prohibit migration of a VM that has mirrors
VMs that mirror an encryption context rely on the owner to keep the
ASID allocated. Performing a KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
would cause a dangling ASID:
1. copy context from A to B (gets ref to A)
2. move context from A to L (moves ASID from A to L)
3. close L (releases ASID from L, B still references it)
The right way to do the handoff instead is to create a fresh mirror VM
on the destination first:
1. copy context from A to B (gets ref to A)
[later] 2. close B (releases ref to A)
3. move context from A to L (moves ASID from A to L)
4. copy context from L to M
So, catch the situation by adding a count of how many VMs are
mirroring this one's encryption context.
Fixes: 697b9f909d28 ("KVM: SEV: Add support for SEV-ES intra host migration")
Message-Id: <20211123005036.2954379-11-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:33 +0000 (19:50 -0500)]
KVM: SEV: Do COPY_ENC_CONTEXT_FROM with both VMs locked
Now that we have a facility to lock two VMs with deadlock
protection, use it for the creation of mirror VMs as well. One of
COPY_ENC_CONTEXT_FROM(dst, src) and COPY_ENC_CONTEXT_FROM(src, dst)
would always fail, so the combination is nonsensical and it is okay to
return -EBUSY if it is attempted.
This sidesteps the question of what happens if a VM is
MOVE_ENC_CONTEXT_FROM'd at the same time as it is
COPY_ENC_CONTEXT_FROM'd: the locking prevents that from
happening.
Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-10-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:32 +0000 (19:50 -0500)]
selftests: sev_migrate_tests: add tests for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM
I am putting the tests in sev_migrate_tests because the failure conditions are
very similar and some of the setup code can be reused, too.
The tests cover both successful creation of a mirror VM, and error
conditions.
Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-9-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:31 +0000 (19:50 -0500)]
KVM: SEV: move mirror status to destination of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
Allow intra-host migration of a mirror VM; the destination VM will be
a mirror of the same ASID as the source.
Fixes: 1167986334b9 ("KVM: SEV: Add support for SEV intra host migration") Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-8-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:30 +0000 (19:50 -0500)]
KVM: SEV: initialize regions_list of a mirror VM
This was broken before the introduction of KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM,
but technically harmless because the region list was unused for a mirror
VM. However, it is untidy and it now causes a NULL pointer access when
attempting to move the encryption context of a mirror VM.
Fixes: 902f83bac915 ("KVM: x86: Support KVM VMs sharing SEV context")
Message-Id: <20211123005036.2954379-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:29 +0000 (19:50 -0500)]
KVM: SEV: cleanup locking for KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM
Encapsulate the handling of the migration_in_progress flag for both VMs in
two functions sev_lock_two_vms and sev_unlock_two_vms. It does not matter
if KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM locks the destination struct kvm a bit
later, and this change 1) keeps the cleanup chain of labels smaller 2)
makes it possible for KVM_CAP_VM_COPY_ENC_CONTEXT_FROM to reuse the logic.
Cc: Peter Gonda <pgonda@google.com> Cc: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-6-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:50:28 +0000 (19:50 -0500)]
KVM: SEV: do not use list_replace_init on an empty list
list_replace_init cannot be used if the source is an empty list,
because "new->next->prev = new" will overwrite "old->next":
new old
prev = new, next = new prev = old, next = old
new->next = old->next prev = new, next = old prev = old, next = old
new->next->prev = new prev = new, next = old prev = old, next = new
new->prev = old->prev prev = old, next = old prev = old, next = old
new->next->prev = new prev = old, next = old prev = new, next = new
The desired outcome instead would be to leave both old and new the same
as they were (two empty circular lists). Use list_cut_before, which
already has the necessary check and is documented to discard the
previous contents of the list that will hold the result.
Fixes: 1167986334b9 ("KVM: SEV: Add support for SEV intra host migration") Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123005036.2954379-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:43:11 +0000 (19:43 -0500)]
KVM: x86: Use a stable condition around all VT-d PI paths
Currently, checks for whether VT-d PI can be used refer to the current
status of the feature in the current vCPU; or they more or less pick
vCPU 0 in case a specific vCPU is not available.
However, these checks do not attempt to synchronize with changes to
the IRTE. In particular, there is no path that updates the IRTE when
APICv is re-activated on vCPU 0; and there is no path to wakeup a CPU
that has APICv disabled, if the wakeup occurs because of an IRTE
that points to a posted interrupt.
To fix this, always go through the VT-d PI path as long as there are
assigned devices and APICv is available on both the host and the VM side.
Since the relevant condition was copied over three times, take the hint
and factor it into a separate function.
Suggested-by: Sean Christopherson <seanjc@google.com> Cc: stable@vger.kernel.org Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-5-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:43:10 +0000 (19:43 -0500)]
KVM: x86: check PIR even for vCPUs with disabled APICv
The IRTE for an assigned device can trigger a POSTED_INTR_VECTOR even
if APICv is disabled on the vCPU that receives it. In that case, the
interrupt will just cause a vmexit and leave the ON bit set together
with the PIR bit corresponding to the interrupt.
Right now, the interrupt would not be delivered until APICv is re-enabled.
However, fixing this is just a matter of always doing the PIR->IRR
synchronization, even if the vCPU has temporarily disabled APICv.
This is not a problem for performance, or if anything it is an
improvement. First, in the common case where vcpu->arch.apicv_active is
true, one fewer check has to be performed. Second, static_call_cond will
elide the function call if APICv is not present or disabled. Finally,
in the case for AMD hardware we can remove the sync_pir_to_irr callback:
it is only needed for apic_has_interrupt_for_ppr, and that function
already has a fallback for !APICv.
Cc: stable@vger.kernel.org Co-developed-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com>
Message-Id: <20211123004311.2954158-4-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Tue, 23 Nov 2021 00:43:09 +0000 (19:43 -0500)]
KVM: VMX: prepare sync_pir_to_irr for running with APICv disabled
If APICv is disabled for this vCPU, assigned devices may still attempt to
post interrupts. In that case, we need to cancel the vmentry and deliver
the interrupt with KVM_REQ_EVENT. Extend the existing code that handles
injection of L1 interrupts into L2 to cover this case as well.
vmx_hwapic_irr_update is only called when APICv is active so it would be
confusing to add a check for vcpu->arch.apicv_active in there. Instead,
just use vmx_set_rvi directly in vmx_sync_pir_to_irr.
Cc: stable@vger.kernel.org Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Reviewed-by: David Matlack <dmatlack@google.com> Reviewed-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211123004311.2954158-3-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: selftests: page_table_test: fix calculation of guest_test_phys_mem
A kvm_page_table_test run with its default settings fails on VMX due to
memory region add failure:
> ==== Test Assertion Failure ====
> lib/kvm_util.c:952: ret == 0
> pid=10538 tid=10538 errno=17 - File exists
> 1 0x00000000004057d1: vm_userspace_mem_region_add at kvm_util.c:947
> 2 0x0000000000401ee9: pre_init_before_test at kvm_page_table_test.c:302
> 3 (inlined by) run_test at kvm_page_table_test.c:374
> 4 0x0000000000409754: for_each_guest_mode at guest_modes.c:53
> 5 0x0000000000401860: main at kvm_page_table_test.c:500
> 6 0x00007f82ae2d8554: ?? ??:0
> 7 0x0000000000401894: _start at ??:?
> KVM_SET_USER_MEMORY_REGION IOCTL failed,
> rc: -1 errno: 17
> slot: 1 flags: 0x0
> guest_phys_addr: 0xc0000000 size: 0x40000000
This is because the memory range that this test is trying to add
(0x0c0000000 - 0x100000000) conflicts with LAPIC mapping at 0x0fee00000.
Looking at the code it seems that guest_test_*phys*_mem variable gets
mistakenly overwritten with guest_test_*virt*_mem while trying to adjust
the former for alignment.
With the correct variable adjusted this test runs successfully.
Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <52e487458c3172923549bbcf9dfccfbe6faea60b.1637940473.git.maciej.szmigiero@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Handle "default" period when selectively waking kthread
Account for the '0' being a default, "let KVM choose" period, when
determining whether or not the recovery worker needs to be awakened in
response to userspace reducing the period. Failure to do so results in
the worker not being awakened properly, e.g. when changing the period
from '0' to any small-ish value.
Fixes: 57fbbb5cb984 ("kvm: x86: mmu: Make NX huge page recovery period configurable") Cc: stable@vger.kernel.org Cc: Junaid Shahid <junaids@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015706.3830341-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Drop the "flush" param and return values to/from the TDP MMU's helper for
zapping collapsible SPTEs. Because the helper runs with mmu_lock held
for read, not write, it uses tdp_mmu_zap_spte_atomic(), and the atomic
zap handles the necessary remote TLB flush.
Similarly, because mmu_lock is dropped and re-acquired between zapping
legacy MMUs and zapping TDP MMUs, kvm_mmu_zap_collapsible_sptes() must
handle remote TLB flushes from the legacy MMU before calling into the TDP
MMU.
Fixes: 08763ffdb3a23 ("KVM: x86/mmu: Skip rmap operations if rmaps not allocated") Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120045046.3940942-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86/mmu: Use yield-safe TDP MMU root iter in MMU notifier unmapping
Use the yield-safe variant of the TDP MMU iterator when handling an
unmapping event from the MMU notifier, as most occurences of the event
allow yielding.
Fixes: 41c263d69428 ("KVM: x86/mmu: Allow yielding during MMU notifier unmap/zap, if possible") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
Message-Id: <20211120015008.3780032-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Wei Yongjun [Mon, 29 Nov 2021 15:16:52 +0000 (15:16 +0000)]
net: mscc: ocelot: fix missing unlock on error in ocelot_hwstamp_set()
Add the missing mutex_unlock before return from function
ocelot_hwstamp_set() in the ocelot_setup_ptp_traps() error
handling case.
Fixes: 53d36c1d30aa ("net: mscc: ocelot: set up traps for PTP packets") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20211129151652.1165433-1-weiyongjun1@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Tue, 30 Nov 2021 04:04:10 +0000 (20:04 -0800)]
Merge tag 'rxrpc-fixes-20211129' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
David Howells says:
====================
rxrpc: Leak fixes
Here are a couple of fixes for leaks in AF_RXRPC:
(1) Fix a leak of rxrpc_peer structs in rxrpc_look_up_bundle().
(2) Fix a leak of rxrpc_local structs in rxrpc_lookup_peer().
* tag 'rxrpc-fixes-20211129' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
rxrpc: Fix rxrpc_local leak in rxrpc_lookup_peer()
rxrpc: Fix rxrpc_peer leak in rxrpc_look_up_bundle()
====================
====================
wireguard/siphash patches for 5.16-rc
Here's quite a largeish set of stable patches I've had queued up and
testing for a number of months now:
- Patch (1) squelches a sparse warning by fixing an annotation.
- Patches (2), (3), and (5) are minor improvements and fixes to the
test suite.
- Patch (4) is part of a tree-wide cleanup to have module-specific
init and exit functions.
- Patch (6) fixes a an issue with dangling dst references, by having a
function to release references immediately rather than deferring,
and adds an associated test case to prevent this from regressing.
- Patches (7) and (8) help mitigate somewhat a potential DoS on the
ingress path due to the use of skb_list's locking hitting contention
on multiple cores by switching to using a ring buffer and dropping
packets on contention rather than locking up another core spinning.
- Patch (9) switches kvzalloc to kvcalloc for better form.
- Patch (10) fixes alignment traps in siphash with clang-13 (and maybe
other compilers) on armv6, by switching to using the unaligned
functions by default instead of the aligned functions by default.
====================
Arnd Bergmann [Mon, 29 Nov 2021 15:39:29 +0000 (10:39 -0500)]
siphash: use _unaligned version by default
On ARM v6 and later, we define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
because the ordinary load/store instructions (ldr, ldrh, ldrb) can
tolerate any misalignment of the memory address. However, load/store
double and load/store multiple instructions (ldrd, ldm) may still only
be used on memory addresses that are 32-bit aligned, and so we have to
use the CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS macro with care, or we
may end up with a severe performance hit due to alignment traps that
require fixups by the kernel. Testing shows that this currently happens
with clang-13 but not gcc-11. In theory, any compiler version can
produce this bug or other problems, as we are dealing with undefined
behavior in C99 even on architectures that support this in hardware,
see also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100363.
Fortunately, the get_unaligned() accessors do the right thing: when
building for ARMv6 or later, the compiler will emit unaligned accesses
using the ordinary load/store instructions (but avoid the ones that
require 32-bit alignment). When building for older ARM, those accessors
will emit the appropriate sequence of ldrb/mov/orr instructions. And on
architectures that can truly tolerate any kind of misalignment, the
get_unaligned() accessors resolve to the leXX_to_cpup accessors that
operate on aligned addresses.
Since the compiler will in fact emit ldrd or ldm instructions when
building this code for ARM v6 or later, the solution is to use the
unaligned accessors unconditionally on architectures where this is
known to be fast. The _aligned version of the hash function is
however still needed to get the best performance on architectures
that cannot do any unaligned access in hardware.
This new version avoids the undefined behavior and should produce
the fastest hash on all architectures we support.
wireguard: ratelimiter: use kvcalloc() instead of kvzalloc()
Use 2-factor argument form kvcalloc() instead of kvzalloc().
Link: https://github.com/KSPP/linux/issues/162 Fixes: 0c73bbc77a76 ("net: WireGuard secure network tunnel") Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
[Jason: Gustavo's link above is for KSPP, but this isn't actually a
security fix, as table_size is bounded to 8192 anyway, and gcc realizes
this, so the codegen comes out to be about the same.] Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wireguard: receive: drop handshakes if queue lock is contended
If we're being delivered packets from multiple CPUs so quickly that the
ring lock is contended for CPU tries, then it's safe to assume that the
queue is near capacity anyway, so just drop the packet rather than
spinning. This helps deal with multicore DoS that can interfere with
data path performance. It _still_ does not completely fix the issue, but
it again chips away at it.
Reported-by: Streun Fabio <fstreun@student.ethz.ch> Fixes: 0c73bbc77a76 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wireguard: receive: use ring buffer for incoming handshakes
Apparently the spinlock on incoming_handshake's skb_queue is highly
contended, and a torrent of handshake or cookie packets can bring the
data plane to its knees, simply by virtue of enqueueing the handshake
packets to be processed asynchronously. So, we try switching this to a
ring buffer to hopefully have less lock contention. This alleviates the
problem somewhat, though it still isn't perfect, so future patches will
have to improve this further. However, it at least doesn't completely
diminish the data plane.
Reported-by: Streun Fabio <fstreun@student.ethz.ch> Reported-by: Joel Wanner <joel.wanner@inf.ethz.ch> Fixes: 0c73bbc77a76 ("net: WireGuard secure network tunnel") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wireguard: device: reset peer src endpoint when netns exits
Each peer's endpoint contains a dst_cache entry that takes a reference
to another netdev. When the containing namespace exits, we take down the
socket and prevent future sockets from being created (by setting
creating_net to NULL), which removes that potential reference on the
netns. However, it doesn't release references to the netns that a netdev
cached in dst_cache might be taking, so the netns still might fail to
exit. Since the socket is gimped anyway, we can simply clear all the
dst_caches (by way of clearing the endpoint src), which will release all
references.
However, the current dst_cache_reset function only releases those
references lazily. But it turns out that all of our usages of
wg_socket_clear_peer_endpoint_src are called from contexts that are not
exactly high-speed or bottle-necked. For example, when there's
connection difficulty, or when userspace is reconfiguring the interface.
And in particular for this patch, when the netns is exiting. So for
those cases, it makes more sense to call dst_release immediately. For
that, we add a small helper function to dst_cache.
This patch also adds a test to netns.sh from Hangbin Liu to ensure this
doesn't regress.
Tested-by: Hangbin Liu <liuhangbin@gmail.com> Reported-by: Xiumei Mu <xmu@redhat.com> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Cc: Paolo Abeni <pabeni@redhat.com> Fixes: eb453e387def ("wireguard: device: avoid circular netns references") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Li Zhijian [Mon, 29 Nov 2021 15:39:24 +0000 (10:39 -0500)]
wireguard: selftests: rename DEBUG_PI_LIST to DEBUG_PLIST
DEBUG_PI_LIST was renamed to DEBUG_PLIST since 793ea1f0c0 ("lib/plist:
rename DEBUG_PI_LIST to DEBUG_PLIST").
Signed-off-by: Li Zhijian <lizhijian@cn.fujitsu.com> Fixes: 793ea1f0c023 ("lib/plist: rename DEBUG_PI_LIST to DEBUG_PLIST") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Randy Dunlap [Mon, 29 Nov 2021 15:39:23 +0000 (10:39 -0500)]
wireguard: main: rename 'mod_init' & 'mod_exit' functions to be module-specific
Rename module_init & module_exit functions that are named
"mod_init" and "mod_exit" so that they are unique in both the
System.map file and in initcall_debug output instead of showing
up as almost anonymous "mod_init".
This is helpful for debugging and in determining how long certain
module_init calls take to execute.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
wireguard: selftests: actually test for routing loops
We previously removed the restriction on looping to self, and then added
a test to make sure the kernel didn't blow up during a routing loop. The
kernel didn't blow up, thankfully, but on certain architectures where
skb fragmentation is easier, such as ppc64, the skbs weren't actually
being discarded after a few rounds through. But the test wasn't catching
this. So actually test explicitly for massive increases in tx to see if
we have a routing loop. Note that the actual loop problem will need to
be addressed in a different commit.
Fixes: c62450e60bca ("wireguard: socket: remove errant restriction on looping to self") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>