David S. Miller [Fri, 10 Jul 2020 21:02:01 +0000 (14:02 -0700)]
Merge tag 'mlx5-fixes-2020-07-02' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2020-07-02
This series introduces some fixes to mlx5 driver.
V1->v2:
- Drop "ip -s" patch and mirred device hold reference patch.
- Will revise them in a later submission.
Please pull and let me know if there is any problem.
For -stable v5.2
('net/mlx5: Fix eeprom support for SFP module')
For -stable v5.4
('net/mlx5e: Fix 50G per lane indication')
For -stable v5.5
('net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash')
('net/mlx5e: Fix VXLAN configuration restore after function reload')
For -stable v5.7
('net/mlx5e: CT: Fix memory leak in cleanup')
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Britstein [Sun, 28 Jun 2020 12:42:26 +0000 (15:42 +0300)]
net/mlx5e: CT: Fix memory leak in cleanup
CT entries are deleted via a workqueue from netfilter. If removing the
module before that, the rules are cleaned by the driver itself, but the
memory entries for them are not freed. Fix that.
Fixes: d342728dba6a ("net/mlx5e: CT: Offload established flows") Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eran Ben Elisha [Mon, 22 Jun 2020 06:03:31 +0000 (09:03 +0300)]
net/mlx5e: Fix port buffers cell size value
Device unit for port buffers size, xoff_threshold and xon_threshold is
cells. Fix a bug in driver where cell unit size was hard-coded to
128 bytes. This hard-coded value is buggy, as it is wrong for some hardware
versions.
Driver to read cell size from SBCAM register and translate bytes to cell
units accordingly.
In order to fix the bug, this patch exposes SBCAM (Shared buffer
capabilities mask) layout and defines.
If SBCAM.cap_cell_size is valid, use it for all bytes to cells
calculations. If not valid, fallback to 128.
Cell size do not change on the fly per device. Instead of issuing SBCAM
access reg command every time such translation is needed, cache it in
mlx5e_dcbx as part of mlx5e_dcbnl_initialize(). Pass dcbx.port_buff_cell_sz
as a param to every function that needs bytes to cells translation.
While fixing the bug, move MLX5E_BUFFER_CELL_SHIFT macro to
en_dcbnl.c, as it is only used by that file.
Fixes: e66b2248cb64 ("net/mlx5e: Receive buffer configuration") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Reviewed-by: Huy Nguyen <huyn@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aya Levin [Mon, 15 Jun 2020 09:48:47 +0000 (12:48 +0300)]
net/mlx5e: Fix 50G per lane indication
Some released FW versions mistakenly don't set the capability that 50G
per lane link-modes are supported for VFs (ptys_extended_ethernet
capability bit). When the capability is unset, read
PTYS.ext_eth_proto_capability (always reliable).
If PTYS.ext_eth_proto_capability is valid (has a non-zero value)
conclude that the HCA supports 50G per lane. Otherwise, conclude that
the HCA doesn't support 50G per lane.
Fixes: 2ea9a63120b5 ("net/mlx5: Add support to ext_* fields introduced in Port Type and Speed register") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Aya Levin [Wed, 20 May 2020 07:37:42 +0000 (10:37 +0300)]
net/mlx5e: Fix CPU mapping after function reload to avoid aRFS RX crash
After function reload, CPU mapping used by aRFS RX is broken, leading to
a kernel panic. Fix by moving initialization of rx_cpu_rmap from
netdev_init to netdev_attach. IRQ table is re-allocated on mlx5_load,
but netdev is not re-initialize.
Aya Levin [Wed, 24 Jun 2020 16:04:03 +0000 (19:04 +0300)]
net/mlx5e: Fix VXLAN configuration restore after function reload
When detaching netdev, remove vxlan port configuration using
udp_tunnel_drop_rx_info. During function reload, configuration will be
restored using udp_tunnel_get_rx_info. This ensures sync between
firmware and driver. Use udp_tunnel_get_rx_info even if its physical
interface is down.
Fixes: 78080f553ea3 ("net/mlx5: Add devlink reload") Signed-off-by: Aya Levin <ayal@mellanox.com> Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Vlad Buslov [Wed, 17 Jun 2020 14:51:53 +0000 (17:51 +0300)]
net/mlx5e: Fix usage of rcu-protected pointer
In mlx5e_configure_flower() flow pointer is protected by rcu read lock.
However, after cited commit the pointer is being used outside of rcu read
block. Extend the block to protect all pointer accesses.
Fixes: 168a1e713885 ("net/mlx5e: Support tc block sharing for representors") Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Vlad Buslov [Wed, 17 Jun 2020 14:26:33 +0000 (17:26 +0300)]
net/mxl5e: Verify that rpriv is not NULL
In helper function is_flow_rule_duplicate_allowed() verify that rpviv
pointer is not NULL before dereferencing it. This can happen when device is
in NIC mode and leads to following crash:
Eran Ben Elisha [Sun, 14 Jun 2020 14:31:26 +0000 (17:31 +0300)]
net/mlx5: Fix eeprom support for SFP module
Fix eeprom SFP query support by setting i2c_addr, offset and page number
correctly. Unlike QSFP modules, SFP eeprom params are as follow:
- i2c_addr is 0x50 for offset 0 - 255 and 0x51 for offset 256 - 511.
- Page number is always zero.
- Page offset is always relative to zero.
As part of eeprom query, query the module ID (SFP / QSFP*) via helper
function to set the params accordingly.
In addition, change mlx5_qsfp_eeprom_page() input type to be u16 to avoid
unnecessary casting.
Fixes: a34061193bc4 ("net/mlx5e: ethtool, Add support for EEPROM high pages query") Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com> Signed-off-by: Huy Nguyen <huyn@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Cong Wang [Thu, 9 Jul 2020 23:28:44 +0000 (16:28 -0700)]
cgroup: Fix sock_cgroup_data on big-endian.
In order for no_refcnt and is_data to be the lowest order two
bits in the 'val' we have to pad out the bitfield of the u8.
Fixes: 39b450b527b3 ("cgroup: fix cgroup_sk_alloc() for sk_clone_lock()") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: make sure listeners don't initialize congestion-control state
syzkaller found its way into setsockopt with TCP_CONGESTION "cdg".
tcp_cdg_init() does a kcalloc to store the gradients. As sk_clone_lock
just copies all the memory, the allocated pointer will be copied as
well, if the app called setsockopt(..., TCP_CONGESTION) on the listener.
If now the socket will be destroyed before the congestion-control
has properly been initialized (through a call to tcp_init_transfer), we
will end up freeing memory that does not belong to that particular
socket, opening the door to a double-free:
Wei Wang fixed a part of these CDG-malloc issues with commit 55043d9bcf92
("tcp: memset ca_priv data to 0 properly").
This patch here fixes the listener-scenario: We make sure that listeners
setting the congestion-control through setsockopt won't initialize it
(thus CDG never allocates on listeners). For those who use AF_UNSPEC to
reuse a socket, tcp_disconnect() is changed to cleanup afterwards.
(The issue can be reproduced at least down to v4.4.x.)
Cc: Wei Wang <weiwan@google.com> Cc: Eric Dumazet <edumazet@google.com> Fixes: c02084a8e83c ("tcp: add CDG congestion control") Signed-off-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Thu, 9 Jul 2020 10:11:50 +0000 (12:11 +0200)]
ethtool: fix genlmsg_put() failure handling in ethnl_default_dumpit()
If the genlmsg_put() call in ethnl_default_dumpit() fails, we bail out
without checking if we already have some messages in current skb like we do
with ethnl_default_dump_one() failure later. Therefore if existing messages
almost fill up the buffer so that there is not enough space even for
netlink and genetlink header, we lose all prepared messages and return and
error.
Rather than duplicating the skb->len check, move the genlmsg_put(),
genlmsg_cancel() and genlmsg_end() calls into ethnl_default_dump_one().
This is also more logical as all message composition will be in
ethnl_default_dump_one() and only iteration logic will be left in
ethnl_default_dumpit().
Fixes: e9533c9ca90b ("ethtool: default handlers for GET requests") Reported-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Thu, 9 Jul 2020 03:13:59 +0000 (20:13 -0700)]
net_sched: fix a memory leak in atm_tc_init()
When tcf_block_get() fails inside atm_tc_init(),
atm_tc_put() is called to release the qdisc p->link.q.
But the flow->ref prevents it to do so, as the flow->ref
is still zero.
Fix this by moving the p->link.ref initialization before
tcf_block_get().
Fixes: 78fe3c434b13 ("net: sched: introduce tcf block infractructure") Reported-and-tested-by: syzbot+d411cff6ab29cc2c311b@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
qed: Populate nvm-file attributes while reading nvm config partition.
NVM config file address will be modified when the MBI image is upgraded.
Driver would return stale config values if user reads the nvm-config
(via ethtool -d) in this state. The fix is to re-populate nvm attribute
info while reading the nvm config values/partition.
Changes from previous version:
-------------------------------
v3: Corrected the formatting in 'Fixes' tag.
v2: Added 'Fixes' tag.
Fixes: 419dd9d0cc6d ("qed: Add configuration information to register dump and debug data") Signed-off-by: Sudarsana Reddy Kalluru <skalluru@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Convert all-mask IP address to Big Endian, instead, for comparison.
Fixes: ab00315bd45f ("cxgb4: use correct type for all-mask IP address comparison") Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hamish Martin [Wed, 8 Jul 2020 21:06:44 +0000 (09:06 +1200)]
tipc: fix retransmission on unicast links
A scenario has been observed where a 'bc_init' message for a link is not
retransmitted if it fails to be received by the peer. This leads to the
peer never establishing the link fully and it discarding all other data
received on the link. In this scenario the message is lost in transit to
the peer.
The issue is traced to the 'nxt_retr' field of the skb not being
initialised for links that aren't a bc_sndlink. This leads to the
comparison in tipc_link_advance_transmq() that gates whether to attempt
retransmission of a message performing in an undesirable way.
Depending on the relative value of 'jiffies', this comparison:
time_before(jiffies, TIPC_SKB_CB(skb)->nxt_retr)
may return true or false given that 'nxt_retr' remains at the
uninitialised value of 0 for non bc_sndlinks.
This is most noticeable shortly after boot when jiffies is initialised
to a high value (to flush out rollover bugs) and we compare a jiffies of,
say, 4294940189 to zero. In that case time_before returns 'true' leading
to the skb not being retransmitted.
The fix is to ensure that all skbs have a valid 'nxt_retr' time set for
them and this is achieved by refactoring the setting of this value into
a central function.
With this fix, transmission losses of 'bc_init' messages do not stall
the link establishment forever because the 'bc_init' message is
retransmitted and the link eventually establishes correctly.
Fixes: 18d6c7aafa7b ("tipc: reduce duplicate packets for unicast traffic") Acked-by: Jon Maloy <jmaloy@redhat.com> Signed-off-by: Hamish Martin <hamish.martin@alliedtelesis.co.nz> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 6 Jul 2020 18:02:32 +0000 (02:02 +0800)]
l2tp: remove skb_dst_set() from l2tp_xmit_skb()
In the tx path of l2tp, l2tp_xmit_skb() calls skb_dst_set() to set
skb's dst. However, it will eventually call inet6_csk_xmit() or
ip_queue_xmit() where skb's dst will be overwritten by:
skb_dst_set_noref(skb, dst);
without releasing the old dst in skb. Then it causes dst/dev refcnt leak:
unregister_netdevice: waiting for eth0 to become free. Usage count = 1
This can be reproduced by simply running:
# modprobe l2tp_eth && modprobe l2tp_ip
# sh ./tools/testing/selftests/net/l2tp.sh
So before going to inet6_csk_xmit() or ip_queue_xmit(), skb's dst
should be dropped. This patch is to fix it by removing skb_dst_set()
from l2tp_xmit_skb() and moving skb_dst_drop() into l2tp_xmit_core().
Fixes: 72ecdb28cfd7 ("[L2TP]: PPP over L2TP driver core") Reported-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: James Chapman <jchapman@katalix.com> Tested-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 8 Jul 2020 19:35:15 +0000 (12:35 -0700)]
Merge branch 'net-smc-fixes'
Karsten Graul says:
====================
net/smc: fixes 2020-07-08
Please apply the following patch series for smc to netdev's net tree.
The patches fix problems found during more testing of SMC
functionality, resulting in hang conditions and unneeded link
deactivations. The clc module was hardened to be prepared for
possible future SMCD versions.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
CLC proposal messages of future SMCD versions could be larger than SMCD
V1 CLC proposal messages.
To enable toleration in SMC V1 the receival of CLC proposal messages
is adapted:
* accept larger length values in CLC proposal
* check trailing eye catcher for incoming CLC proposal with V1 length only
* receive the whole CLC proposal even in cases it does not fit into the
V1 buffer
Fixes: 2c0ab8bdf9515 ("smc: support variable CLC proposal messages") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The similar smc_ib_devices spinlock has been converted to a mutex.
Protecting the smcd_dev_list by a mutex is possible as well. This
patch converts the smcd_dev_list spinlock to a mutex.
Fixes: aada6413944e ("net/smc: add base infrastructure for SMC-D and ISM") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Function smc_pnet_find_rdma_dev() might be called from
smc_pnet_find_roce_resource(). It holds the smc_ib_devices list
spinlock while calling infiniband op get_netdev(). At least for mlx5
the get_netdev operation wants mutex serialization, which conflicts
with the smc_ib_devices spinlock.
This patch switches the smc_ib_devices spinlock into a mutex to
allow sleeping when calling get_netdev().
Fixes: 0bb7e7b2f4d9 ("smc: introduce SMC as an IB-client") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Wait for pending sends only when smc_switch_conns() found a link to move
the connections to. Do not wait during link freeing, this can lead to
permanent hang situations. And refuse to provide a new tx slot on an
unusable link.
Fixes: 983e309bb3b8 ("net/smc: switch connections to alternate link") Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/smc: separate LLC wait queues for flow and messages
There might be races in scenarios where both SMC link groups are on the
same system. Prevent that by creating separate wait queues for LLC flows
and messages. Switch to non-interruptable versions of wait_event() and
wake_up() for the llc flow waiter to make sure the waiters get control
sequentially. Fine tune the llc_flow_lock to include the assignment of
the message. Write to system log when an unexpected message was
dropped. And remove an extra indirection and use the existing local
variable lgr in smc_llc_enqueue().
Fixes: 96fc5c847c20 ("net/smc: add event-based llc_flow framework") Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: atlantic: fix ip dst and ipv6 address filters
This patch fixes ip dst and ipv6 address filters.
There were 2 mistakes in the code, which led to the issue:
* invalid register was used for ipv4 dst address;
* incorrect write order of dwords for ipv6 addresses.
Fixes: 26a65e6d72ba ("net: aquantia: add rx-flow filter definitions") Signed-off-by: Dmitry Bogdanov <dbogdanov@marvell.com> Signed-off-by: Mark Starovoytov <mstarovoitov@marvell.com> Signed-off-by: Alexander Lobakin <alobakin@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Shannon Nelson [Tue, 7 Jul 2020 21:13:26 +0000 (14:13 -0700)]
ionic: centralize queue reset code
The queue reset pattern is used in a couple different places,
only slightly different from each other, and could cause
issues if one gets changed and the other didn't. This puts
them together so that only one version is needed, yet each
can have slighty different effects by passing in a pointer
to a work function to do whatever configuration twiddling is
needed in the middle of the reset.
This specifically addresses issues seen where under loops
of changing ring size or queue count parameters we could
occasionally bump into the netdev watchdog.
v2: added more commit message commentary
Fixes: 47577625b856 ("ionic: Add initial ethtool support") Signed-off-by: Shannon Nelson <snelson@pensando.io> Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
vlan: consolidate VLAN parsing code and limit max parsing depth
Toshiaki pointed out that we now have two very similar functions to extract
the L3 protocol number in the presence of VLAN tags. And Daniel pointed out
that the unbounded parsing loop makes it possible for maliciously crafted
packets to loop through potentially hundreds of tags.
Fix both of these issues by consolidating the two parsing functions and
limiting the VLAN tag parsing to a max depth of 8 tags. As part of this,
switch over __vlan_get_protocol() to use skb_header_pointer() instead of
pskb_may_pull(), to avoid the possible side effects of the latter and keep
the skb pointer 'const' through all the parsing functions.
v2:
- Use limit of 8 tags instead of 32 (matching XMIT_RECURSION_LIMIT)
Reported-by: Toshiaki Makita <toshiaki.makita1@gmail.com> Reported-by: Daniel Borkmann <daniel@iogearbox.net> Fixes: 497fdc5a2504 ("sched: consistently handle layer3 header accesses in the presence of VLANs") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When generating debug dump, driver firstly collects all data in binary
form, and then performs per-feature formatting to human-readable if it
is supported.
For ethtool -d, this is roughly incorrect for two reasons. First of all,
drivers should always provide only original raw dumps to Ethtool without
any changes.
The second, and more critical, is that Ethtool's output buffer size is
strictly determined by ethtool_ops::get_regs_len(), and all data *must*
fit in it. The current version of driver always returns the size of raw
data, but the size of the formatted buffer exceeds it in most cases.
This leads to out-of-bound writes and memory corruption.
Address both issues by adding an option to return original, non-formatted
debug data, and using it for Ethtool case.
v2:
- Expand commit message to make it more clear;
- No functional changes.
Fixes: 364549c6ac9a ("qed: Add support for debug data collection") Signed-off-by: Alexander Lobakin <alobakin@marvell.com> Signed-off-by: Igor Russkikh <irusskikh@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Commit 3039c80113a6 ("net: bridge: mcast: fix stale nsrcs pointer in
igmp3/mld2 report handling") introduced a bug in the IPv6 header payload
length check which would potentially lead to rejecting a valid MLD2 Report:
The check needs to take into account the 2 bytes for the "Number of
Sources" field in the "Multicast Address Record" before reading it.
And not the size of a pointer to this field.
Fixes: 3039c80113a6 ("net: bridge: mcast: fix stale nsrcs pointer in igmp3/mld2 report handling") Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin Varghese [Sun, 5 Jul 2020 08:53:49 +0000 (14:23 +0530)]
net: Added pointer check for dst->ops->neigh_lookup in dst_neigh_lookup_skb
The packets from tunnel devices (eg bareudp) may have only
metadata in the dst pointer of skb. Hence a pointer check of
neigh_lookup is needed in dst_neigh_lookup_skb
Kernel crashes when packets from bareudp device is processed in
the kernel neighbour subsytem.
Fixes: 0d716309e857 ("Fix dst_neigh_lookup/dst_neigh_lookup_skb return value handling bug") Signed-off-by: Martin Varghese <martin.varghese@nokia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
rwlock.h should not be included directly. Instead linux/splinlock.h
should be included. Including it directly will break the RT build.
Fixes: d3f142a8c01c2 ("net/mlx5e: Extract neigh-specific code from en_rep.c to rep/neigh.c") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 3 Jul 2020 16:06:04 +0000 (18:06 +0200)]
mptcp: fix DSS map generation on fin retransmission
The RFC 8684 mandates that no-data DATA FIN packets should carry
a DSS with 0 sequence number and data len equal to 1. Currently,
on FIN retransmission we re-use the existing mapping; if the previous
fin transmission was part of a partially acked data packet, we could
end-up writing in the egress packet a non-compliant DSS.
The above will be detected by a "Bad mapping" warning on the receiver
side.
This change addresses the issue explicitly checking for 0 len packet
when adding the DATA_FIN option.
Fixes: 9689c66bea9f ("mptcp: Write MPTCP DSS headers to outgoing data packets") Reported-by: syzbot+42a07faa5923cfaeb9c9@syzkaller.appspotmail.com Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: fill fl4_icmp_{type,code} in ping_v4_sendmsg
IPv4 ping sockets don't set fl4.fl4_icmp_{type,code}, which leads to
incomplete IPsec ACQUIRE messages being sent to userspace. Currently,
both raw sockets and IPv6 ping sockets set those fields.
Expected output of "ip xfrm monitor":
acquire proto esp
sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 8 code 0 dev ens4
policy src 10.0.2.15/32 dst 8.8.8.8/32
<snip>
Currently with ping sockets:
acquire proto esp
sel src 10.0.2.15/32 dst 8.8.8.8/32 proto icmp type 0 code 0 dev ens4
policy src 10.0.2.15/32 dst 8.8.8.8/32
<snip>
The Libreswan test suite found this problem after Fedora changed the
value for the sysctl net.ipv4.ping_group_range.
Fixes: cfce3383c7be ("net: ipv4: add IPPROTO_ICMP socket kind") Reported-by: Paul Wouters <pwouters@redhat.com> Tested-by: Paul Wouters <pwouters@redhat.com> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ethernet: fec: prevent tx starvation under high rx load
In the ISR, we poll the event register for the queues in need of
service and then enter polled mode. After this point, the event
register will never be read again until we exit polled mode.
In a scenario where a UDP flow is routed back out through the same
interface, i.e. "router-on-a-stick" we'll typically only see an rx
queue event initially. Once we start to process the incoming flow
we'll be locked polled mode, but we'll never clean the tx rings since
that event is never caught.
Eventually the netdev watchdog will trip, causing all buffers to be
dropped and then the process starts over again.
Rework the NAPI poll to keep trying to consome the entire budget as
long as new events are coming in, making sure to service all rx/tx
queues, in priority order, on each pass.
Fixes: 8ca69680e0a8 ("net: fec: change data structure to support multiqueue") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Tested-by: Fugang Duan <fugang.duan@nxp.com> Reviewed-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Thu, 2 Jul 2020 18:52:56 +0000 (11:52 -0700)]
cgroup: fix cgroup_sk_alloc() for sk_clone_lock()
When we clone a socket in sk_clone_lock(), its sk_cgrp_data is
copied, so the cgroup refcnt must be taken too. And, unlike the
sk_alloc() path, sock_update_netprioidx() is not called here.
Therefore, it is safe and necessary to grab the cgroup refcnt
even when cgroup_sk_alloc is disabled.
sk_clone_lock() is in BH context anyway, the in_interrupt()
would terminate this function if called there. And for sk_alloc()
skcd->val is always zero. So it's safe to factor out the code
to make it more readable.
The global variable 'cgroup_sk_alloc_disabled' is used to determine
whether to take these reference counts. It is impossible to make
the reference counting correct unless we save this bit of information
in skcd->val. So, add a new bit there to record whether the socket
has already taken the reference counts. This obviously relies on
kmalloc() to align cgroup pointers to at least 4 bytes,
ARCH_KMALLOC_MINALIGN is certainly larger than that.
This bug seems to be introduced since the beginning, commit 386f5e5782dd ("cgroup: duplicate cgroup reference when cloning sockets")
tried to fix it but not compeletely. It seems not easy to trigger until
the recent commit dd17751a691c
("netprio_cgroup: Fix unlimited memory leak of v2 cgroups") was merged.
Fixes: 588b947267eb ("sock, cgroup: add sock->sk_cgroup") Reported-by: Cameron Berkenpas <cam@neo-zeon.de> Reported-by: Peter Geis <pgwipeout@gmail.com> Reported-by: Lu Fengqi <lufq.fnst@cn.fujitsu.com> Reported-by: Daniël Sonck <dsonck92@gmail.com> Reported-by: Zhang Qiang <qiang.zhang@windriver.com> Tested-by: Cameron Berkenpas <cam@neo-zeon.de> Tested-by: Peter Geis <pgwipeout@gmail.com> Tested-by: Thomas Lamprecht <t.lamprecht@proxmox.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Zefan Li <lizefan@huawei.com> Cc: Tejun Heo <tj@kernel.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 7 Jul 2020 13:39:24 +0000 (07:39 -0600)]
ipv6: Fix use of anycast address with loopback
Thomas reported a regression with IPv6 and anycast using the following
reproducer:
echo 1 > /proc/sys/net/ipv6/conf/all/forwarding
ip -6 a add fc12::1/16 dev lo
sleep 2
echo "pinging lo"
ping6 -c 2 fc12::
The conversion of addrconf_f6i_alloc to use ip6_route_info_create missed
the use of fib6_is_reject which checks addresses added to the loopback
interface and sets the REJECT flag as needed. Update fib6_is_reject for
loopback checks to handle RTF_ANYCAST addresses.
Fixes: 157c4223e771 ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create") Reported-by: thomas.gambier@nexedi.com Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Building the kernel with W=1 produces numerous warnings for the IPA
code. Some of those warnings turn out to flag real problems, and
this series fixes them. The first patch fixes the most important
ones, but the second and third are problems I think are worth
treating as bugs as well.
Note: I'll happily combine any of these if someone prefers that.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Mon, 6 Jul 2020 23:10:10 +0000 (18:10 -0500)]
net: ipa: include declarations in "ipa_gsi.c"
Include "ipa_gsi.h" in "ipa_gsi.c", so the public functions are
defined before they are used in "ipa_gsi.c". This addresses some
warnings that are reported with a "W=1" build.
Fixes: de8a36e34197 ("soc: qcom: ipa: IPA interface to GSI") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Mon, 6 Jul 2020 23:10:09 +0000 (18:10 -0500)]
net: ipa: declare struct types in "ipa_gsi.h"
Pointers to two struct types are used in "ipa_gsi.h", without those
struct types being forward-declared. Add these declarations.
Fixes: de8a36e34197 ("soc: qcom: ipa: IPA interface to GSI") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Mon, 6 Jul 2020 23:10:08 +0000 (18:10 -0500)]
net: ipa: fix QMI structure definition bugs
Building with "W=1" did exactly what it was supposed to do, namely
point out some suspicious-looking code to be verified not to contain
bugs.
Some QMI message structures defined in "ipa_qmi_msg.c" contained
some bad field names (duplicating the "elem_size" field instead of
defining the "offset" field), almost certainly due to copy/paste
errors that weren't obvious in a scan of the code. Fix these bugs.
Fixes: 8b29b90afdb0 ("soc: qcom: ipa: AP/modem communications") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Mon, 6 Jul 2020 17:45:07 +0000 (11:45 -0600)]
ipv6: fib6_select_path can not use out path for nexthop objects
Brian reported a crash in IPv6 code when using rpfilter with a setup
running FRR and external nexthop objects. The root cause of the crash
is fib6_select_path setting fib6_nh in the result to NULL because of
an improper check for nexthop objects.
More specifically, rpfilter invokes ip6_route_lookup with flowi6_oif
set causing fib6_select_path to be called with have_oif_match set.
fib6_select_path has early check on have_oif_match and jumps to the
out label which presumes a builtin fib6_nh. This path is invalid for
nexthop objects; for external nexthops fib6_select_path needs to just
return if the fib6_nh has already been set in the result otherwise it
returns after the call to nexthop_path_fib6_result. Update the check
on have_oif_match to not bail on external nexthops.
Update selftests for this problem.
Fixes: b6b1bd779ad2 ("ipv6: Plumb support for nexthop object in a fib6_info") Reported-by: Brian Rak <brak@choopa.com> Signed-off-by: David Ahern <dsahern@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Yonglong Liu [Mon, 6 Jul 2020 11:26:02 +0000 (19:26 +0800)]
net: hns3: fix use-after-free when doing self test
Enable promisc mode of PF, set VF link state to enable, and
run iperf of the VF, then do self test of the PF. The self test
will fail with a low frequency, and may cause a use-after-free
problem.
The length of packet sent by the selftest process is only
128 + 14 bytes, and the min buffer size of a BD is 256 bytes,
and the receive process will make sure the packet sent by
the selftest process is in the linear part, so only check
the linear part in hns3_lb_check_skb_data().
So fix this use-after-free by using skb_headlen() to dump
skb->data instead of skb->len.
Fixes: b598c1a72a4c ("net: hns3: Add mac loopback selftest support in hns3 driver") Signed-off-by: Yonglong Liu <liuyonglong@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Mon, 6 Jul 2020 11:26:01 +0000 (19:26 +0800)]
net: hns3: add a missing uninit debugfs when unload driver
When unloading driver, if flag HNS3_NIC_STATE_INITED has been
already cleared, the debugfs will not be uninitialized, so fix it.
Fixes: 0dca317e261e ("net: hns3: Add debugfs framework registration") Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Mon, 6 Jul 2020 11:26:00 +0000 (19:26 +0800)]
net: hns3: fix for mishandle of asserting VF reset fail
When asserts VF reset fail, flag HCLGEVF_STATE_CMD_DISABLE
and handshake status should not set, otherwise the retry will
fail. So adds a check for asserting VF reset and returns
directly when fails.
Fixes: f0bb8df120e8 ("net: hns3: stop handling command queue while resetting VF") Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Mon, 6 Jul 2020 11:25:59 +0000 (19:25 +0800)]
net: hns3: check reset pending after FLR prepare
If there is a PF reset pending before FLR prepare, FLR's
preparatory work will not fail, but the FLR rebuild procedure
will fail for this pending. So this PF reset pending should
be handled in the FLR preparatory.
Fixes: 8228cfcbf96a ("net: hns3: refactor the precedure of PF FLR") Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Andre Edich [Mon, 6 Jul 2020 08:39:35 +0000 (10:39 +0200)]
smsc95xx: avoid memory leak in smsc95xx_bind
In a case where the ID_REV register read is failed, the memory for a
private data structure has to be freed before returning error from the
function smsc95xx_bind.
Fixes: 59b6c9c01bfb6 ("smsc95xx: add wol support for more frame types") Signed-off-by: Andre Edich <andre.edich@microchip.com> Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xie He [Mon, 6 Jul 2020 00:45:21 +0000 (17:45 -0700)]
drivers/net/wan/lapbether: Fixed the value of hard_header_len
When this driver transmits data,
first this driver will remove a pseudo header of 1 byte,
then the lapb module will prepend the LAPB header of 2 or 3 bytes,
then this driver will prepend a length field of 2 bytes,
then the underlying Ethernet device will prepend its own header.
So, the header length required should be:
-1 + 3 + 2 + "the header length needed by the underlying device".
This patch fixes kernel panic when this driver is used with AF_PACKET
SOCK_DGRAM sockets.
Signed-off-by: Xie He <xie.he.0141@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
net: rmnet: fix interface leak for rmnet module
There are two problems in rmnet module that they occur the leak of
a lower interface.
The symptom is the same, which is the leak of a lower interface.
But there are two different real problems.
This patchset is to fix these real problems.
1. Do not allow to have different two modes.
As a lower interface of rmnet, there are two modes that they are VND
and BRIDGE.
One interface can have only one mode.
But in the current rmnet, there is no code to prevent to have
two modes in one lower interface.
So, interface leak occurs.
2. Do not allow to add multiple bridge interfaces.
rmnet can have only two bridge interface.
If an additional bridge interface is tried to be attached,
rmnet should deny it.
But there is no code to do that.
So, interface leak occurs.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: rmnet: do not allow to add multiple bridge interfaces
rmnet can have only two bridge interface.
One of them is a link interface and another one is added by
the master operation.
rmnet interface shouldn't allow adding additional
bridge interfaces by mater operation.
But, there is no code to deny additional interfaces.
So, interface leak occurs.
Test commands:
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link add rmnet0 link dummy0 type rmnet mux_id 1
ip link set dummy1 master rmnet0
ip link set dummy2 master rmnet0
ip link del rmnet0
In the above test command, the dummy0 was attached to rmnet as VND mode.
Then, dummy1 was attached to rmnet0 as BRIDGE mode.
At this point, dummy0 mode is switched from VND to BRIDGE automatically.
Then, dummy2 is attached to rmnet as BRIDGE mode.
At this point, rmnet0 should deny this operation.
But, rmnet0 doesn't deny this.
So that below splat occurs when the rmnet0 interface is deleted.
Fixes: 9f22e4531a2a ("net: rmnet: use upper/lower device infrastructure") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
There are two types of the lower interface of rmnet that are VND
and BRIDGE.
Each lower interface can have only one type either VND or BRIDGE.
But, there is a case, which uses both lower interface types.
Due to this unexpected behavior, lower interface leak occurs.
Test commands:
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add rmnet0 link dummy0 type rmnet mux_id 1
ip link set dummy1 master rmnet0
ip link add rmnet1 link dummy1 type rmnet mux_id 2
ip link del rmnet0
The dummy1 was attached as BRIDGE interface of rmnet0.
Then, it also was attached as VND interface of rmnet1.
This is unexpected behavior and there is no code for handling this case.
So that below splat occurs when the rmnet0 interface is deleted.
Fixes: 9f22e4531a2a ("net: rmnet: use upper/lower device infrastructure") Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
hsr: fix interface leak in error path of hsr_dev_finalize()
To release hsr(upper) interface, it should release
its own lower interfaces first.
Then, hsr(upper) interface can be released safely.
In the current code of error path of hsr_dev_finalize(), it releases hsr
interface before releasing a lower interface.
So, a warning occurs, which warns about the leak of lower interfaces.
In order to fix this problem, changing the ordering of the error path of
hsr_dev_finalize() is needed.
Test commands:
ip link add dummy0 type dummy
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
ip link add hsr1 type hsr slave1 dummy2 slave2 dummy0
Fixes: 84ba9354fb6e ("hsr: use upper/lower device infrastructure") Reported-by: syzbot+7f1c020f68dab95aab59@syzkaller.appspotmail.com Signed-off-by: Taehee Yoo <ap420073@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Luo bin [Sat, 4 Jul 2020 07:32:43 +0000 (15:32 +0800)]
hinic: fix sending mailbox timeout in aeq event work
When sending mailbox in the work of aeq event, another aeq event
will be triggered. because the last aeq work is not exited and only
one work can be excuted simultaneously in the same workqueue, mailbox
sending function will return failure of timeout. We create and use
another workqueue to fix this.
Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Fri, 3 Jul 2020 22:41:15 +0000 (15:41 -0700)]
Documentation: networking: rxrpc: drop doubled word
Drop the doubled word "have".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: David Howells <dhowells@redhat.com> Cc: linux-afs@lists.infradead.org Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Fri, 3 Jul 2020 22:41:14 +0000 (15:41 -0700)]
Documentation: networking: ipvs-sysctl: drop doubled word
Drop the doubled word "that".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Fri, 3 Jul 2020 22:41:13 +0000 (15:41 -0700)]
Documentation: networking: ip-sysctl: drop doubled word
Drop the doubled word "that".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Fri, 3 Jul 2020 22:41:12 +0000 (15:41 -0700)]
Documentation: networking: dsa: drop doubled word
Drop the doubled word "in".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: Andrew Lunn <andrew@lunn.ch> Cc: Vivien Didelot <vivien.didelot@gmail.com> Cc: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Fri, 3 Jul 2020 22:41:11 +0000 (15:41 -0700)]
Documentation: networking: can_ucan_protocol: drop doubled words
Drop the doubled words "the" and "of".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: Wolfgang Grandegger <wg@grandegger.com> Cc: Marc Kleine-Budde <mkl@pengutronix.de> Cc: linux-can@vger.kernel.org Acked-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Fri, 3 Jul 2020 22:41:10 +0000 (15:41 -0700)]
Documentation: networking: ax25: drop doubled word
Drop the doubled word "and".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Cc: Ralf Baechle <ralf@linux-mips.org> Cc: linux-hams@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Fri, 3 Jul 2020 22:41:09 +0000 (15:41 -0700)]
Documentation: networking: arcnet: drop doubled word
Drop the doubled word "to".
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: linux-doc@vger.kernel.org Cc: "David S. Miller" <davem@davemloft.net> Cc: Jakub Kicinski <kuba@kernel.org> Cc: netdev@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
sched: consistently handle layer3 header accesses in the presence of VLANs
There are a couple of places in net/sched/ that check skb->protocol and act
on the value there. However, in the presence of VLAN tags, the value stored
in skb->protocol can be inconsistent based on whether VLAN acceleration is
enabled. The commit quoted in the Fixes tag below fixed the users of
skb->protocol to use a helper that will always see the VLAN ethertype.
However, most of the callers don't actually handle the VLAN ethertype, but
expect to find the IP header type in the protocol field. This means that
things like changing the ECN field, or parsing diffserv values, stops
working if there's a VLAN tag, or if there are multiple nested VLAN
tags (QinQ).
To fix this, change the helper to take an argument that indicates whether
the caller wants to skip the VLAN tags or not. When skipping VLAN tags, we
make sure to skip all of them, so behaviour is consistent even in QinQ
mode.
To make the helper usable from the ECN code, move it to if_vlan.h instead
of pkt_sched.h.
v3:
- Remove empty lines
- Move vlan variable definitions inside loop in skb_protocol()
- Also use skb_protocol() helper in IP{,6}_ECN_decapsulate() and
bpf_skb_ecn_set_ce()
v2:
- Use eth_type_vlan() helper in skb_protocol()
- Also fix code that reads skb->protocol directly
- Change a couple of 'if/else if' statements to switch constructs to avoid
calling the helper twice
Reported-by: Ilya Ponetayev <i.ponetaev@ndmsystems.com> Fixes: 8590533cfba4 ("net: sched: fix skb->protocol use in case of accelerated vlan path") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Ferre [Thu, 2 Jul 2020 13:00:21 +0000 (15:00 +0200)]
MAINTAINERS: net: macb: add Claudiu as co-maintainer
I would like that Claudiu becomes co-maintainer of the Cadence macb
driver. He's already participating to lots of reviews and enhancements
to this driver and knows the different versions of this controller.
Signed-off-by: Nicolas Ferre <nicolas.ferre@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: set the correct number of ports
The number of ports is incorrectly set to the maximum available for a DSA
switch. Even if the extra ports are not used, this causes some functions
to be called later, like port_disable() and port_stp_state_set(). If the
driver doesn't check the port index, it will end up modifying unknown
registers.
Fixes: 651fe1942f44 ("dsa: add DSA switch driver for Microchip KSZ9477") Signed-off-by: Codrin Ciubotariu <codrin.ciubotariu@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 2 Jul 2020 01:39:33 +0000 (18:39 -0700)]
tcp: md5: allow changing MD5 keys in all socket states
This essentially reverts commit 7b168934eb0b ("tcp: md5: reject TCP_MD5SIG
or TCP_MD5SIG_EXT on established sockets")
Mathieu reported that many vendors BGP implementations can
actually switch TCP MD5 on established flows.
Quoting Mathieu :
Here is a list of a few network vendors along with their behavior
with respect to TCP MD5:
- Cisco: Allows for password to be changed, but within the hold-down
timer (~180 seconds).
- Juniper: When password is initially set on active connection it will
reset, but after that any subsequent password changes no network
resets.
- Nokia: No notes on if they flap the tcp connection or not.
- Ericsson/RedBack: Allows for 2 password (old/new) to co-exist until
both sides are ok with new passwords.
- Meta-Switch: Expects the password to be set before a connection is
attempted, but no further info on whether they reset the TCP
connection on a change.
- Avaya: Disable the neighbor, then set password, then re-enable.
- Zebos: Would normally allow the change when socket connected.
We can revert my prior change because commit a0317a93d91d ("tcp: md5: fix potential
overestimation of TCP option space") removed the leak of 4 kernel bytes to
the wire that was the main reason for my patch.
While doing my investigations, I found a bug when a MD5 key is changed, leading
to these commits that stable teams want to consider before backporting this revert :
Fixes: 7b168934eb0b "tcp: md5: reject TCP_MD5SIG or TCP_MD5SIG_EXT on established sockets" Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Helmut Grohne [Wed, 1 Jul 2020 11:22:20 +0000 (13:22 +0200)]
net: dsa: microchip: enable ksz9893 via i2c in the ksz9477 driver
The KSZ9893 3-Port Gigabit Ethernet Switch can be controlled via SPI,
I²C or MDIO (very limited and not supported by this driver). While there
is already a compatible entry for the SPI bus, it was missing for I²C.
Signed-off-by: Helmut Grohne <helmut.grohne@intenta.de> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 30 Jun 2020 20:51:28 +0000 (13:51 -0700)]
tcp: fix SO_RCVLOWAT possible hangs under high mem pressure
Whenever tcp_try_rmem_schedule() returns an error, we are under
trouble and should make sure to wakeup readers so that they
can drain socket queues and eventually make room.
Fixes: 83cc0c6df61d ("tcp: avoid extra wakeups for SO_RCVLOWAT users") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When no full socket is available, skbs are sent over a per-netns
control socket. Its sk_mark is temporarily adjusted to match that
of the real (request or timewait) socket or to reflect an incoming
skb, so that the outgoing skb inherits this in __ip_make_skb.
Introduction of the socket cookie mark field broke this. Now the
skb is set through the cookie and cork:
<caller> # init sockc.mark from sk_mark or cmsg
ip_append_data
ip_setup_cork # convert sockc.mark to cork mark
ip_push_pending_frames
ip_finish_skb
__ip_make_skb # set skb->mark to cork mark
But I missed these special control sockets. Update all callers of
__ip(6)_make_skb that were originally missed.
For IPv6, the same two icmp(v6) paths are affected. The third
case is not, as commit 89b9d37ad133 ("tcp: don't annotate
mark on control socket from tcp_v6_send_response()") replaced
the ctl_sk->sk_mark with passing the mark field directly as a
function argument. That commit predates the commit that
introduced the bug.
Fixes: 736982ede330 ("ip: support SO_MARK cmsg") Signed-off-by: Willem de Bruijn <willemb@google.com> Reported-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 1 Jul 2020 19:41:23 +0000 (12:41 -0700)]
tcp: md5: do not send silly options in SYNCOOKIES
Whenever cookie_init_timestamp() has been used to encode
ECN,SACK,WSCALE options, we can not remove the TS option in the SYNACK.
Otherwise, tcp_synack_options() will still advertize options like WSCALE
that we can not deduce later when receiving the packet from the client
to complete 3WHS.
Note that modern linux TCP stacks wont use MD5+TS+SACK in a SYN packet,
but we can not know for sure that all TCP stacks have the same logic.
Before the fix a tcpdump would exhibit this wrong exchange :
10:12:15.464591 IP C > S: Flags [S], seq 4202415601, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 456965269 ecr 0,nop,wscale 8], length 0
10:12:15.464602 IP S > C: Flags [S.], seq 253516766, ack 4202415602, win 65535, options [nop,nop,md5 valid,mss 1400,nop,nop,sackOK,nop,wscale 8], length 0
10:12:15.464611 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid], length 0
10:12:15.464678 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid], length 12
10:12:15.464685 IP S > C: Flags [.], ack 13, win 65535, options [nop,nop,md5 valid], length 0
After this patch the exchange looks saner :
11:59:59.882990 IP C > S: Flags [S], seq 517075944, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508483 ecr 0,nop,wscale 8], length 0
11:59:59.883002 IP S > C: Flags [S.], seq 1902939253, ack 517075945, win 65535, options [nop,nop,md5 valid,mss 1400,sackOK,TS val 1751508479 ecr 1751508483,nop,wscale 8], length 0
11:59:59.883012 IP C > S: Flags [.], ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 0
11:59:59.883114 IP C > S: Flags [P.], seq 1:13, ack 1, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508479], length 12
11:59:59.883122 IP S > C: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508483 ecr 1751508483], length 0
11:59:59.883152 IP S > C: Flags [P.], seq 1:13, ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508483], length 12
11:59:59.883170 IP C > S: Flags [.], ack 13, win 256, options [nop,nop,md5 valid,nop,nop,TS val 1751508484 ecr 1751508484], length 0
Of course, no SACK block will ever be added later, but nothing should break.
Technically, we could remove the 4 nops included in MD5+TS options,
but again some stacks could break seeing not conventional alignment.
Fixes: 8a6be7dd82d0 ("TCPCT part 1g: Responder Cookie => Initiator") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David S. Miller <davem@davemloft.net>
rds: If one path needs re-connection, check all and re-connect
In testing with mprds enabled, Oracle Cluster nodes after reboot were
not able to communicate with others nodes and so failed to rejoin
the cluster. Peers with lower IP address initiated connection but the
node could not respond as it choose a different path and could not
initiate a connection as it had a higher IP address.
With this patch, when a node sends out a packet and the selected path
is down, all other paths are also checked and any down paths are
re-connected.
Reviewed-by: Ka-cheong Poon <ka-cheong.poon@oracle.com> Reviewed-by: David Edmondson <david.edmondson@oracle.com> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com> Signed-off-by: Rao Shoaib <rao.shoaib@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
My prior fix went a bit too far, according to Herbert and Mathieu.
Since we accept that concurrent TCP MD5 lookups might see inconsistent
keys, we can use READ_ONCE()/WRITE_ONCE() instead of smp_rmb()/smp_wmb()
Clearing all key->key[] is needed to avoid possible KMSAN reports,
if key->keylen is increased. Since tcp_md5_do_add() is not fast path,
using __GFP_ZERO to clear all struct tcp_md5sig_key is simpler.
data_race() was added in linux-5.8 and will prevent KCSAN reports,
this can safely be removed in stable backports, if data_race() is
not yet backported.
v2: use data_race() both in tcp_md5_hash_key() and tcp_md5_do_add()
Fixes: c284be44de78 ("tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Marco Elver <elver@google.com> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Tranchetti [Tue, 30 Jun 2020 17:50:17 +0000 (11:50 -0600)]
genetlink: remove genl_bind
A potential deadlock can occur during registering or unregistering a
new generic netlink family between the main nl_table_lock and the
cb_lock where each thread wants the lock held by the other, as
demonstrated below.
1) Thread 1 is performing a netlink_bind() operation on a socket. As part
of this call, it will call netlink_lock_table(), incrementing the
nl_table_users count to 1.
2) Thread 2 is registering (or unregistering) a genl_family via the
genl_(un)register_family() API. The cb_lock semaphore will be taken for
writing.
3) Thread 1 will call genl_bind() as part of the bind operation to handle
subscribing to GENL multicast groups at the request of the user. It will
attempt to take the cb_lock semaphore for reading, but it will fail and
be scheduled away, waiting for Thread 2 to finish the write.
4) Thread 2 will call netlink_table_grab() during the (un)registration
call. However, as Thread 1 has incremented nl_table_users, it will not
be able to proceed, and both threads will be stuck waiting for the
other.
genl_bind() is a noop, unless a genl_family implements the mcast_bind()
function to handle setting up family-specific multicast operations. Since
no one in-tree uses this functionality as Cong pointed out, simply removing
the genl_bind() function will remove the possibility for deadlock, as there
is no attempt by Thread 1 above to take the cb_lock semaphore.
Fixes: 92681daad901 ("genetlink: pass multicast bind/unbind to families") Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Johannes Berg <johannes.berg@intel.com> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Luo bin [Wed, 1 Jul 2020 03:16:33 +0000 (11:16 +0800)]
hinic: fix passing non negative value to ERR_PTR
get_dev_cap and set_resources_state functions may return a positive
value because of hardware failure, and the positive return value
can not be passed to ERR_PTR directly.
Fixes: 4c5b2b71b830 ("hinic: add sriov feature support") Signed-off-by: Luo bin <luobin9@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Tue, 30 Jun 2020 11:46:15 +0000 (14:46 +0300)]
net: qrtr: Fix an out of bounds read qrtr_endpoint_post()
This code assumes that the user passed in enough data for a
qrtr_hdr_v1 or qrtr_hdr_v2 struct, but it's not necessarily true. If
the buffer is too small then it will read beyond the end.
Reported-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reported-by: syzbot+b8fe393f999a291a9ea6@syzkaller.appspotmail.com Fixes: 27121d1ab57e ("net: qrtr: Support decoding incoming v2 packets") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 30 Jun 2020 23:41:01 +0000 (16:41 -0700)]
tcp: md5: add missing memory barriers in tcp_md5_do_add()/tcp_md5_hash_key()
MD5 keys are read with RCU protection, and tcp_md5_do_add()
might update in-place a prior key.
Normally, typical RCU updates would allocate a new piece
of memory. In this case only key->key and key->keylen might
be updated, and we do not care if an incoming packet could
see the old key, the new one, or some intermediate value,
since changing the key on a live flow is known to be problematic
anyway.
We only want to make sure that in the case key->keylen
is changed, cpus in tcp_md5_hash_key() wont try to use
uninitialized data, or crash because key->keylen was
read twice to feed sg_init_one() and ahash_request_set_crypt()
Fixes: bbe5b0dda39b ("tcp: md5: check md5 signature without socket lock") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Carl Huang [Tue, 30 Jun 2020 06:52:51 +0000 (14:52 +0800)]
net: qrtr: free flow in __qrtr_node_release
The flow is allocated in qrtr_tx_wait, but not freed when qrtr node
is released. (*slot) becomes NULL after radix_tree_iter_delete is
called in __qrtr_node_release. The fix is to save (*slot) to a
vairable and then free it.
This memory leak is catched when kmemleak is enabled in kernel,
the report looks like below:
Li Heng [Mon, 29 Jun 2020 10:49:51 +0000 (18:49 +0800)]
net: cxgb4: fix return error value in t4_prep_fw
t4_prep_fw goto bye tag with positive return value when something
bad happened and which can not free resource in adap_init0.
so fix it to return negative value.
Fixes: d4ae5f6f3830 ("cxgb4: Add new scheme to update T4/T5 firmware") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: Li Heng <liheng40@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net* tree.
We've added 28 non-merge commits during the last 9 day(s) which contain
a total of 35 files changed, 486 insertions(+), 232 deletions(-).
The main changes are:
1) Fix an incorrect verifier branch elimination for PTR_TO_BTF_ID pointer
types, from Yonghong Song.
2) Fix UAPI for sockmap and flow_dissector progs that were ignoring various
arguments passed to BPF_PROG_{ATTACH,DETACH}, from Lorenz Bauer & Jakub Sitnicki.
3) Fix broken AF_XDP DMA hacks that are poking into dma-direct and swiotlb
internals and integrate it properly into DMA core, from Christoph Hellwig.
4) Fix RCU splat from recent changes to avoid skipping ingress policy when
kTLS is enabled, from John Fastabend.
5) Fix BPF ringbuf map to enforce size to be the power of 2 in order for its
position masking to work, from Andrii Nakryiko.
6) Fix regression from CAP_BPF work to re-allow CAP_SYS_ADMIN for loading
of network programs, from Maciej Żenczykowski.
7) Fix libbpf section name prefix for devmap progs, from Jesper Dangaard Brouer.
8) Fix formatting in UAPI documentation for BPF helpers, from Quentin Monnet.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Yonghong Song [Tue, 30 Jun 2020 17:12:41 +0000 (10:12 -0700)]
bpf: Add tests for PTR_TO_BTF_ID vs. null comparison
Add two tests for PTR_TO_BTF_ID vs. null ptr comparison,
one for PTR_TO_BTF_ID in the ctx structure and the
other for PTR_TO_BTF_ID after one level pointer chasing.
In both cases, the test ensures condition is not
removed.
For example, for this test
struct bpf_fentry_test_t {
struct bpf_fentry_test_t *a;
};
int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
{
if (arg == 0)
test7_result = 1;
return 0;
}
Before the previous verifier change, we have xlated codes:
int test7(long long unsigned int * ctx):
; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
0: (79) r1 = *(u64 *)(r1 +0)
; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
1: (b4) w0 = 0
2: (95) exit
After the previous verifier change, we have:
int test7(long long unsigned int * ctx):
; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
0: (79) r1 = *(u64 *)(r1 +0)
; if (arg == 0)
1: (55) if r1 != 0x0 goto pc+4
; test7_result = 1;
2: (18) r1 = map[id:6][0]+48
4: (b7) r2 = 1
5: (7b) *(u64 *)(r1 +0) = r2
; int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
6: (b4) w0 = 0
7: (95) exit
Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Andrii Nakryiko <andriin@fb.com> Link: https://lore.kernel.org/bpf/20200630171241.2523875-1-yhs@fb.com
Yonghong Song [Tue, 30 Jun 2020 17:12:40 +0000 (10:12 -0700)]
bpf: Fix an incorrect branch elimination by verifier
Wenbo reported an issue in [1] where a checking of null
pointer is evaluated as always false. In this particular
case, the program type is tp_btf and the pointer to
compare is a PTR_TO_BTF_ID.
The current verifier considers PTR_TO_BTF_ID always
reprents a non-null pointer, hence all PTR_TO_BTF_ID compares
to 0 will be evaluated as always not-equal, which resulted
in the branch elimination.
For example,
struct bpf_fentry_test_t {
struct bpf_fentry_test_t *a;
};
int BPF_PROG(test7, struct bpf_fentry_test_t *arg)
{
if (arg == 0)
test7_result = 1;
return 0;
}
int BPF_PROG(test8, struct bpf_fentry_test_t *arg)
{
if (arg->a == 0)
test8_result = 1;
return 0;
}
In above bpf programs, both branch arg == 0 and arg->a == 0
are removed. This may not be what developer expected.
The bug is introduced by Commit 6784c7410601 ("bpf: Verifier
track null pointer branch_taken with JNE and JEQ"),
where PTR_TO_BTF_ID is considered to be non-null when evaluting
pointer vs. scalar comparison. This may be added
considering we have PTR_TO_BTF_ID_OR_NULL in the verifier
as well.
PTR_TO_BTF_ID_OR_NULL is added to explicitly requires
a non-NULL testing in selective cases. The current generic
pointer tracing framework in verifier always
assigns PTR_TO_BTF_ID so users does not need to
check NULL pointer at every pointer level like a->b->c->d.
We may not want to assign every PTR_TO_BTF_ID as
PTR_TO_BTF_ID_OR_NULL as this will require a null test
before pointer dereference which may cause inconvenience
for developers. But we could avoid branch elimination
to preserve original code intention.
This patch simply removed PTR_TO_BTD_ID from reg_type_not_null()
in verifier, which prevented the above branches from being eliminated.
David S. Miller [Tue, 30 Jun 2020 20:10:57 +0000 (13:10 -0700)]
Merge branch 'net-ipa-three-bug-fixes'
Alex Elder says:
====================
net: ipa: three bug fixes
This series contains three bug fixes for the Qualcomm IPA driver.
In practice these bugs are unlikke.y to be harmful, but they do
represent incorrect code.
Version 2 adds "Fixes" tags to two of the patches and fixes a typo
in one (found by checkpatch.pl).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Tue, 30 Jun 2020 12:44:44 +0000 (07:44 -0500)]
net: ipa: introduce ipa_cmd_tag_process()
Create a new function ipa_cmd_tag_process() that simply allocates a
transaction, adds a tag process command to it to clear the hardware
pipeline, and commits the transaction.
Call it in from ipa_endpoint_suspend(), after suspending the modem
endpoints but before suspending the AP command TX and AP LAN RX
endpoints (which are used by the tag sequence).
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Tue, 30 Jun 2020 12:44:43 +0000 (07:44 -0500)]
net: ipa: no checksum offload for SDM845 LAN RX
The AP LAN RX endpoint should not have download checksum offload
enabled.
The receive handler does properly accommodate the trailer that's
added by the hardware, but we ignore it.
Fixes: 5b985feb23da ("soc: qcom: ipa: configuration data") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alex Elder [Tue, 30 Jun 2020 12:44:42 +0000 (07:44 -0500)]
net: ipa: always check for stopped channel
In gsi_channel_stop(), there's a check to see if the channel might
have entered STOPPED state since a previous call, which might have
timed out before stopping completed.
That check actually belongs in gsi_channel_stop_command(), which is
called repeatedly by gsi_channel_stop() for RX channels.
Fixes: 31210ccab1a3 ("soc: qcom: ipa: the generic software interface") Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Russell King [Tue, 30 Jun 2020 10:04:40 +0000 (11:04 +0100)]
net: mvneta: fix use of state->speed
When support for short preambles was added, it incorrectly keyed its
decision off state->speed instead of state->interface. state->speed
is not guaranteed to be correct for in-band modes, which can lead to
short preambles being unexpectedly disabled.
Fix this by keying off the interface mode, which is the only way that
mvneta can operate at 2.5Gbps.
Fixes: cad86d4d1581 ("net: mvneta: Add support for 2500Mbps SGMII") Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
support AF_PACKET for layer 3 devices
Hans reported that packets injected by a correct-looking and trivial
libpcap-based program were not being accepted by wireguard. In
investigating that, I noticed that a few devices weren't properly
handling AF_PACKET-injected packets, and so this series introduces a bit
of shared infrastructure to support that.
The basic problem begins with socket(AF_PACKET, SOCK_RAW,
htons(ETH_P_ALL)) sockets. When sendto is called, AF_PACKET examines the
headers of the packet with this logic:
The middle condition there triggers, and we jump to
dev_parse_header_protocol. Note that this is the only caller of
dev_parse_header_protocol in the kernel, and I assume it was designed
for this purpose:
if (!dev->header_ops || !dev->header_ops->parse_protocol)
return 0;
return dev->header_ops->parse_protocol(skb);
}
Since AF_PACKET already knows which netdev the packet is going to, the
dev_parse_header_protocol function can see if that netdev has a way it
prefers to figure out the protocol from the header. This, again, is the
only use of parse_protocol in the kernel. At the moment, it's only used
with ethernet devices, via eth_header_parse_protocol. This makes sense,
as mostly people are used to AF_PACKET-injecting ethernet frames rather
than layer 3 frames. But with nothing in place for layer 3 netdevs, this
function winds up returning 0, and skb->protocol then is set to 0, and
then by the time it hits the netdev's ndo_start_xmit, the driver doesn't
know what to do with it.
This is a problem because drivers very much rely on skb->protocol being
correct, and routinely reject packets where it's incorrect. That's why
having this parsing happen for injected packets is quite important. In
wireguard, ipip, and ipip6, for example, packets from AF_PACKET are just
dropped entirely. For tun devices, it's sort of uglier, with the tun
"packet information" header being passed to userspace containing a bogus
protocol value. Some userspace programs are ill-equipped to deal with
that. (But of course, that doesn't happen with tap devices, which
benefit from the similar shared infrastructure for layer 2 netdevs,
further motiviating this patchset for layer 3 netdevs.)
This patchset addresses the issue by first adding a layer 3 header parse
function, much akin to the existing one for layer 2 packets, and then
adds a shared header_ops structure that, also much akin to the existing
one for layer 2 packets. Then it wires it up to a few immediate places
that stuck out as requiring it, and does a bit of cleanup.
This patchset seems like it's fixing real bugs, so it might be
appropriate for stable. But they're also very old bugs, so if you'd
rather not backport to stable, that'd make sense to me too.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
net: xfrmi: implement header_ops->parse_protocol for AF_PACKET
The xfrm interface uses skb->protocol to determine packet type, and
bails out if it's not set. For AF_PACKET injection, we need to support
its call chain of:
Without a valid parse_protocol, this returns zero, and xfrmi rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Without a valid parse_protocol, this returns zero, and sit rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Without a valid parse_protocol, this returns zero, and the tun driver
then gives userspace bogus values that it can't deal with.
Note that this isn't the case with tap, because tap already benefits
from the shared infrastructure for ethernet headers. But with tun,
there's nothing.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
wireguard: queueing: make use of ip_tunnel_parse_protocol
Now that wg_examine_packet_protocol has been added for general
consumption as ip_tunnel_parse_protocol, it's possible to remove
wg_examine_packet_protocol and simply use the new
ip_tunnel_parse_protocol function directly.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
wireguard: implement header_ops->parse_protocol for AF_PACKET
WireGuard uses skb->protocol to determine packet type, and bails out if
it's not set or set to something it's not expecting. For AF_PACKET
injection, we need to support its call chain of:
Without a valid parse_protocol, this returns zero, and wireguard then
rejects the skb. So, this wires up the ip_tunnel handler for layer 3
packets for that case.
Reported-by: Hans Wippel <ndev@hwipl.net> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Without a valid parse_protocol, this returns zero, and ipip rejects the
skb. So, this wires up the ip_tunnel handler for layer 3 packets for
that case.
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>