In event of failure during register_netdevice, free_netdev is
invoked immediately. free_netdev assumes that all the netdevice
refcounts have been dropped prior to it being called and as a
result frees and clears out the refcount pointer.
However, this is not necessarily true as some of the operations
in the NETDEV_UNREGISTER notifier handlers queue RCU callbacks for
invocation after a grace period. The IPv4 callback in_dev_rcu_put
tries to access the refcount after free_netdev is called which
leads to a null de-reference-
Fix this by waiting for the completion of the call_rcu() in
case of register_netdevice errors.
Fixes: 19dbffef8f48 ("[NET]: Fix free_netdev on register_netdev failure.") Cc: Sean Tranchetti <stranche@codeaurora.org> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: Fix the link time qualifier of 'ping_v6_proc_exit_net()'
The '.exit' functions from 'pernet_operations' structure should be marked
as __net_exit, not __net_init.
Fixes: 05e023d3e730 ("net: ipv6: Implement /proc/net/icmp6.") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
tun_chr_read_iter() accessed the memory which freed by free_netdev()
called by tun_set_iff():
CPUA CPUB
tun_set_iff()
alloc_netdev_mqs()
tun_attach()
tun_chr_read_iter()
tun_get()
tun_do_read()
tun_ring_recv()
register_netdevice() <-- inject error
goto err_detach
tun_detach_all() <-- set RCV_SHUTDOWN
free_netdev() <-- called from
err_free_dev path
netdev_freemem() <-- free the memory
without check refcount
(In this path, the refcount cannot prevent
freeing the memory of dev, and the memory
will be used by dev_put() called by
tun_chr_read_iter() on CPUB.)
(Break from tun_ring_recv(),
because RCV_SHUTDOWN is set)
tun_put()
dev_put() <-- use the memory
freed by netdev_freemem()
Put the publishing of tfile->tun after register_netdevice(),
so tun_get() won't get the tun pointer that freed by
err_detach path if register_netdevice() failed.
Fixes: e3f1e44aba97 ("tuntap: attach queue 0 before registering netdevice") Reported-by: Hulk Robot <hulkci@huawei.com> Suggested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 11 Sep 2019 23:05:52 +0000 (00:05 +0100)]
Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
Jeff Kirsher says:
====================
Intel Wired LAN Driver Updates 2019-09-11
This series contains fixes to ixgbe.
Alex fixes up the adaptive ITR scheme for ixgbe which could result in a
value that was either 0 or something less than 10 which was causing
issues with hardware features, like RSC, that do not function well with
ITR values that low.
Ilya Maximets fixes the ixgbe driver to limit the number of transmit
descriptors to clean by the number of transmit descriptors used in the
transmit ring, so that the driver does not try to "double" clean the
same descriptors.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
tcp: fix tcp_ecn_withdraw_cwr() to clear TCP_ECN_QUEUE_CWR
Fix tcp_ecn_withdraw_cwr() to clear the correct bit:
TCP_ECN_QUEUE_CWR.
Rationale: basically, TCP_ECN_DEMAND_CWR is a bit that is purely about
the behavior of data receivers, and deciding whether to reflect
incoming IP ECN CE marks as outgoing TCP th->ece marks. The
TCP_ECN_QUEUE_CWR bit is purely about the behavior of data senders,
and deciding whether to send CWR. The tcp_ecn_withdraw_cwr() function
is only called from tcp_undo_cwnd_reduction() by data senders during
an undo, so it should zero the sender-side state,
TCP_ECN_QUEUE_CWR. It does not make sense to stop the reflection of
incoming CE bits on incoming data packets just because outgoing
packets were spuriously retransmitted.
The bug has been reproduced with packetdrill to manifest in a scenario
with RFC3168 ECN, with an incoming data packet with CE bit set and
carrying a TCP timestamp value that causes cwnd undo. Before this fix,
the IP CE bit was ignored and not reflected in the TCP ECE header bit,
and sender sent a TCP CWR ('W') bit on the next outgoing data packet,
even though the cwnd reduction had been undone. After this fix, the
sender properly reflects the CE bit and does not set the W bit.
Note: the bug actually predates 2005 git history; this Fixes footer is
chosen to be the oldest SHA1 I have tested (from Sep 2007) for which
the patch applies cleanly (since before this commit the code was in a
.h file).
Fixes: 35c47e4f71f8 ("[TCP]: Move code from tcp_ecn.h to tcp*.c and tcp.h & remove it") Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilya Maximets [Thu, 22 Aug 2019 17:12:37 +0000 (20:12 +0300)]
ixgbe: fix double clean of Tx descriptors with xdp
Tx code doesn't clear the descriptors' status after cleaning.
So, if the budget is larger than number of used elems in a ring, some
descriptors will be accounted twice and xsk_umem_complete_tx will move
prod_tail far beyond the prod_head breaking the completion queue ring.
Fix that by limiting the number of descriptors to clean by the number
of used descriptors in the Tx ring.
'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
'ixgbe_xsk_clean_tx_ring()' since we're allowed to directly use
'next_to_clean' and 'next_to_use' indexes.
CC: stable@vger.kernel.org Fixes: 4963a253a7e9 ("ixgbe: add AF_XDP zero-copy Tx support") Signed-off-by: Ilya Maximets <i.maximets@samsung.com> Tested-by: William Tu <u9012063@gmail.com> Tested-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Wed, 4 Sep 2019 15:07:11 +0000 (08:07 -0700)]
ixgbe: Prevent u8 wrapping of ITR value to something less than 10us
There were a couple cases where the ITR value generated via the adaptive
ITR scheme could exceed 126. This resulted in the value becoming either 0
or something less than 10. Switching back and forth between a value less
than 10 and a value greater than 10 can cause issues as certain hardware
features such as RSC to not function well when the ITR value has dropped
that low.
CC: stable@vger.kernel.org Fixes: 14f7dfd61995 ("ixgbe: Update adaptive ITR algorithm") Reported-by: Gregg Leventhal <gleventhal@janestreet.com> Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
net/rds: An rds_sock is added too early to the hash table
In rds_bind(), an rds_sock is added to the RDS bind hash table before
rs_transport is set. This means that the socket can be found by the
receive code path when rs_transport is NULL. And the receive code
path de-references rs_transport for congestion update check. This can
cause a panic. An rds_sock should not be added to the bind hash table
before all the needed fields are set.
Reported-by: syzbot+4b4f8163c2e246df3c4c@syzkaller.appspotmail.com Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
mac80211: Do not send Layer 2 Update frame before authorization
The Layer 2 Update frame is used to update bridges when a station roams
to another AP even if that STA does not transmit any frames after the
reassociation. This behavior was described in IEEE Std 802.11F-2003 as
something that would happen based on MLME-ASSOCIATE.indication, i.e.,
before completing 4-way handshake. However, this IEEE trial-use
recommended practice document was published before RSN (IEEE Std
802.11i-2004) and as such, did not consider RSN use cases. Furthermore,
IEEE Std 802.11F-2003 was withdrawn in 2006 and as such, has not been
maintained amd should not be used anymore.
Sending out the Layer 2 Update frame immediately after association is
fine for open networks (and also when using SAE, FT protocol, or FILS
authentication when the station is actually authenticated by the time
association completes). However, it is not appropriate for cases where
RSN is used with PSK or EAP authentication since the station is actually
fully authenticated only once the 4-way handshake completes after
authentication and attackers might be able to use the unauthenticated
triggering of Layer 2 Update frame transmission to disrupt bridge
behavior.
Fix this by postponing transmission of the Layer 2 Update frame from
station entry addition to the point when the station entry is marked
authorized. Similarly, send out the VLAN binding update only if the STA
entry has already been authorized.
Signed-off-by: Jouni Malinen <jouni@codeaurora.org> Reviewed-by: Johannes Berg <johannes@sipsolutions.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Randy Dunlap [Mon, 9 Sep 2019 21:54:21 +0000 (14:54 -0700)]
lib/Kconfig: fix OBJAGG in lib/ menu structure
Keep the "Library routines" menu intact by moving OBJAGG into it.
Otherwise OBJAGG is displayed/presented as an orphan in the
various config menus.
Fixes: 3216638d528d ("lib: introduce initial implementation of object aggregation manager") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Ido Schimmel <idosch@mellanox.com> Cc: David S. Miller <davem@davemloft.net> Tested-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mao Wenan [Wed, 11 Sep 2019 01:36:23 +0000 (09:36 +0800)]
net: sonic: replace dev_kfree_skb in sonic_send_packet
sonic_send_packet will be processed in irq or non-irq
context, so it would better use dev_kfree_skb_any
instead of dev_kfree_skb.
Fixes: f635c5c0bf2f ("*sonic/natsemi/ns83829: Move the National Semi-conductor drivers") Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 9 Sep 2019 07:33:29 +0000 (15:33 +0800)]
sctp: fix the missing put_user when dumping transport thresholds
This issue causes SCTP_PEER_ADDR_THLDS sockopt not to be able to dump
a transport thresholds info.
Fix it by adding 'goto' put_user in sctp_getsockopt_paddr_thresholds.
Fixes: ba4662b8002b ("sctp: add SCTP_FUTURE_ASSOC for SCTP_PEER_ADDR_THLDS sockopt") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cong Wang [Sun, 8 Sep 2019 19:11:23 +0000 (12:11 -0700)]
net_sched: check cops->tcf_block in tc_bind_tclass()
At least sch_red and sch_tbf don't implement ->tcf_block()
while still have a non-zero tc "class".
Instead of adding nop implementations to each of such qdisc's,
we can just relax the check of cops->tcf_block() in
tc_bind_tclass(). They don't support TC filter anyway.
Reported-by: syzbot+21b29db13c065852f64b@syzkaller.appspotmail.com Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Nicolas Dichtel [Fri, 6 Sep 2019 09:47:02 +0000 (11:47 +0200)]
bridge/mdb: remove wrong use of NLM_F_MULTI
NLM_F_MULTI must be used only when a NLMSG_DONE message is sent at the end.
In fact, NLMSG_DONE is sent only at the end of a dump.
Libraries like libnl will wait forever for NLMSG_DONE.
Fixes: d1718ff886ae ("bridge: mdb: notify on router port add and del") CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Suchanek [Mon, 9 Sep 2019 20:44:51 +0000 (22:44 +0200)]
net/ibmvnic: Fix missing { in __ibmvnic_reset
Commit 2ff0c5c4bb1a ("net/ibmvnic: free reset work of removed device from queue")
adds a } without corresponding { causing build break.
Fixes: 2ff0c5c4bb1a ("net/ibmvnic: free reset work of removed device from queue") Signed-off-by: Michal Suchanek <msuchanek@suse.de> Reviewed-by: Tyrel Datwyler <tyreld@linux.ibm.com> Reviewed-by: Juliet Kim <julietk@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fred Lotter [Fri, 6 Sep 2019 17:29:41 +0000 (19:29 +0200)]
nfp: flower: cmsg rtnl locks can timeout reify messages
Flower control message replies are handled in different locations. The truly
high priority replies are handled in the BH (tasklet) context, while the
remaining replies are handled in a predefined Linux work queue. The work
queue handler orders replies into high and low priority groups, and always
start servicing the high priority replies within the received batch first.
Reply Type: Rtnl Lock: Handler:
CMSG_TYPE_PORT_MOD no BH tasklet (mtu)
CMSG_TYPE_TUN_NEIGH no BH tasklet
CMSG_TYPE_FLOW_STATS no BH tasklet
CMSG_TYPE_PORT_REIFY no WQ high
CMSG_TYPE_PORT_MOD yes WQ high (link/mtu)
CMSG_TYPE_MERGE_HINT yes WQ low
CMSG_TYPE_NO_NEIGH no WQ low
CMSG_TYPE_ACTIVE_TUNS no WQ low
CMSG_TYPE_QOS_STATS no WQ low
CMSG_TYPE_LAG_CONFIG no WQ low
A subset of control messages can block waiting for an rtnl lock (from both
work queue priority groups). The rtnl lock is heavily contended for by
external processes such as systemd-udevd, systemd-network and libvirtd,
especially during netdev creation, such as when flower VFs and representors
are instantiated.
Kernel netlink instrumentation shows that external processes (such as
systemd-udevd) often use successive rtnl_trylock() sequences, which can result
in an rtnl_lock() blocked control message to starve for longer periods of time
during rtnl lock contention, i.e. netdev creation.
In the current design a single blocked control message will block the entire
work queue (both priorities), and introduce a latency which is
nondeterministic and dependent on system wide rtnl lock usage.
In some extreme cases, one blocked control message at exactly the wrong time,
just before the maximum number of VFs are instantiated, can block the work
queue for long enough to prevent VF representor REIFY replies from getting
handled in time for the 40ms timeout.
The firmware will deliver the total maximum number of REIFY message replies in
around 300us.
Only REIFY and MTU update messages require replies within a timeout period (of
40ms). The MTU-only updates are already done directly in the BH (tasklet)
handler.
Move the REIFY handler down into the BH (tasklet) in order to resolve timeouts
caused by a blocked work queue waiting on rtnl locks.
Signed-off-by: Fred Lotter <frederik.lotter@netronome.com> Signed-off-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list
Historically, support for frag_list packets entering skb_segment() was
limited to frag_list members terminating on exact same gso_size
boundaries. This is verified with a BUG_ON since commit 4520824f83bc
("net: Add frag_list support to skb_segment"), quote:
As such we require all frag_list members terminate on exact MSS
boundaries. This is checked using BUG_ON.
As there should only be one producer in the kernel of such packets,
namely GRO, this requirement should not be difficult to maintain.
However, since commit d9d6a8701ab0 ("bpf: add bpf_skb_change_proto helper"),
the "exact MSS boundaries" assumption no longer holds:
An eBPF program using bpf_skb_change_proto() DOES modify 'gso_size', but
leaves the frag_list members as originally merged by GRO with the
original 'gso_size'. Example of such programs are bpf-based NAT46 or
NAT64.
This lead to a kernel BUG_ON for flows involving:
- GRO generating a frag_list skb
- bpf program performing bpf_skb_change_proto() or bpf_skb_adjust_room()
- skb_segment() of the skb
See example BUG_ON reports in [0].
In commit 3061de5806fc ("net: permit skb_segment on head_frag frag_list skb"),
skb_segment() was modified to support the "gso_size mangling" case of
a frag_list GRO'ed skb, but *only* for frag_list members having
head_frag==true (having a page-fragment head).
Alas, GRO packets having frag_list members with a linear kmalloced head
(head_frag==false) still hit the BUG_ON.
This commit adds support to skb_segment() for a 'head_skb' packet having
a frag_list whose members are *non* head_frag, with gso_size mangled, by
disabling SG and thus falling-back to copying the data from the given
'head_skb' into the generated segmented skbs - as suggested by Willem de
Bruijn [1].
Since this approach involves the penalty of skb_copy_and_csum_bits()
when building the segments, care was taken in order to enable this
solution only when required:
- untrusted gso_size, by testing SKB_GSO_DODGY is set
(SKB_GSO_DODGY is set by any gso_size mangling functions in
net/core/filter.c)
- the frag_list is non empty, its item is a non head_frag, *and* the
headlen of the given 'head_skb' does not match the gso_size.
Fixes: d9d6a8701ab0 ("bpf: add bpf_skb_change_proto helper") Suggested-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: addrconf_f6i_alloc - fix non-null pointer check to !IS_ERR()
Fixes a stupid bug I recently introduced...
ip6_route_info_create() returns an ERR_PTR(err) and not a NULL on error.
Fixes: bc0155842fce ("net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)'") Cc: David Ahern <dsahern@gmail.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The problem is that capi_write() is reading past the end of the message.
Fix it by checking the message's length in the needed places.
Reported-and-tested-by: syzbot+0849c524d9c634f5ae66@syzkaller.appspotmail.com Signed-off-by: Eric Biggers <ebiggers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Juliet Kim [Thu, 5 Sep 2019 21:30:01 +0000 (17:30 -0400)]
net/ibmvnic: free reset work of removed device from queue
Commit 529a9a94b51e ("ibmvnic: Do not process reset during or after
device removal") made the change to exit reset if the driver has been
removed, but does not free reset work items of the adapter from queue.
Ensure all reset work items are freed when breaking out of the loop early.
Fixes: 529a9a94b51e ("ibmnvic: Do not process reset during or after device removal”) Signed-off-by: Juliet Kim <julietk@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Chulski [Thu, 5 Sep 2019 16:46:18 +0000 (19:46 +0300)]
net: phylink: Fix flow control resolution
Regarding to IEEE 802.3-2015 standard section 2
28B.3 Priority resolution - Table 28-3 - Pause resolution
In case of Local device Pause=1 AsymDir=0, Link partner
Pause=1 AsymDir=1, Local device resolution should be enable PAUSE
transmit, disable PAUSE receive.
And in case of Local device Pause=1 AsymDir=1, Link partner
Pause=1 AsymDir=0, Local device resolution should be enable PAUSE
receive, disable PAUSE transmit.
Fixes: d3eead77c792 ("phylink: add phylink infrastructure") Signed-off-by: Stefan Chulski <stefanc@marvell.com> Reported-by: Shaul Ben-Mayor <shaulb@marvell.com> Acked-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 6 Sep 2019 13:21:44 +0000 (15:21 +0200)]
Merge tag 'wireless-drivers-for-davem-2019-09-05' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers
Kalle Valo says:
====================
wireless-drivers fixes for 5.3
Fourth set of fixes for 5.3, and hopefully really the last one. Quite
a few CVE fixes this time but at least to my knowledge none of them
have a known exploit.
mt76
* workaround firmware hang by disabling hardware encryption on MT7630E
* disable 5GHz band for MT7630E as it's not working properly
mwifiex
* fix IE parsing to avoid a heap buffer overflow
iwlwifi
* fix for QuZ device initialisation
rt2x00
* another fix for rekeying
* revert a commit causing degradation in rx signal levels
rsi
* fix a double free
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
MAINTAINERS: add myself as maintainer for xilinx axiethernet driver
I am maintaining xilinx axiethernet driver in xilinx tree and would like
to maintain it in the mainline kernel as well. Hence adding myself as a
maintainer. Also Anirudha and John has moved to new roles, so based on
request removing them from the maintainer list.
Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com> Acked-by: John Linn <john.linn@xilinx.com> Acked-by: Michal Simek <michal.simek@xilinx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 5 Sep 2019 12:20:22 +0000 (05:20 -0700)]
net: sched: fix reordering issues
Whenever MQ is not used on a multiqueue device, we experience
serious reordering problems. Bisection found the cited
commit.
The issue can be described this way :
- A single qdisc hierarchy is shared by all transmit queues.
(eg : tc qdisc replace dev eth0 root fq_codel)
- When/if try_bulk_dequeue_skb_slow() dequeues a packet targetting
a different transmit queue than the one used to build a packet train,
we stop building the current list and save the 'bad' skb (P1) in a
special queue. (bad_txq)
- When dequeue_skb() calls qdisc_dequeue_skb_bad_txq() and finds this
skb (P1), it checks if the associated transmit queues is still in frozen
state. If the queue is still blocked (by BQL or NIC tx ring full),
we leave the skb in bad_txq and return NULL.
- dequeue_skb() calls q->dequeue() to get another packet (P2)
The other packet can target the problematic queue (that we found
in frozen state for the bad_txq packet), but another cpu just ran
TX completion and made room in the txq that is now ready to accept
new packets.
- Packet P2 is sent while P1 is still held in bad_txq, P1 might be sent
at next round. In practice P2 is the lead of a big packet train
(P2,P3,P4 ...) filling the BQL budget and delaying P1 by many packets :/
To solve this problem, we have to block the dequeue process as long
as the first packet in bad_txq can not be sent. Reordering issues
disappear and no side effects have been seen.
Fixes: b16e81465636 ("net: sched: explicit locking in gso_cpu fallback") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: John Fastabend <john.fastabend@gmail.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
1) Several xfrm interface fixes from Nicolas Dichtel:
- Avoid an interface ID corruption on changelink.
- Fix wrong intterface names in the logs.
- Fix a list corruption when changing network namespaces.
- Fix unregistation of the underying phydev.
2) Fix a potential warning when merging xfrm_plocy nodes.
From Florian Westphal.
Please pull or let me know if there are problems.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
forcedeth: use per cpu to collect xmit/recv statistics
When testing with a background iperf pushing 1Gbit/sec traffic and running
both ifconfig and netstat to collect statistics, some deadlocks occurred.
Ifconfig and netstat will call nv_get_stats64 to get software xmit/recv
statistics. In the commit 5f1d6b455b07 ("forcedeth: implement
ndo_get_stats64() API"), the normal tx/rx variables is to collect tx/rx
statistics. The fix is to replace normal tx/rx variables with per
cpu 64-bit variable to collect xmit/recv statistics. The per cpu variable
will avoid deadlocks and provide fast efficient statistics updates.
In nv_probe, the per cpu variable is initialized. In nv_remove, this
per cpu variable is freed.
In xmit/recv process, this per cpu variable will be updated.
In nv_get_stats64, this per cpu variable on each cpu is added up. Then
the driver can get xmit/recv packets statistics.
A test runs for several days with this commit, the deadlocks disappear
and the performance is better.
Tested:
- iperf SMP x86_64 ->
Client connecting to 1.1.1.108, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 1.1.1.105 port 38888 connected with 1.1.1.108 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 1.10 GBytes 943 Mbits/sec
Fixes: 5f1d6b455b07 ("forcedeth: implement ndo_get_stats64() API") CC: Joe Jin <joe.jin@oracle.com> CC: JUNXIAO_BI <junxiao.bi@oracle.com> Reported-and-tested-by: Nan san <nan.1986san@gmail.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Mao Wenan [Thu, 5 Sep 2019 01:57:12 +0000 (09:57 +0800)]
net: sonic: return NETDEV_TX_OK if failed to map buffer
NETDEV_TX_BUSY really should only be used by drivers that call
netif_tx_stop_queue() at the wrong moment. If dma_map_single() is
failed to map tx DMA buffer, it might trigger an infinite loop.
This patch use NETDEV_TX_OK instead of NETDEV_TX_BUSY, and change
printk to pr_err_ratelimited.
Fixes: f635c5c0bf2f ("*sonic/natsemi/ns83829: Move the National Semi-conductor drivers") Signed-off-by: Mao Wenan <maowenan@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The problem can be seen in the following two tests:
0: (bf) r3 = r10
1: (55) if r3 != 0x7b goto pc+0
2: (7a) *(u64 *)(r3 -8) = 0
3: (79) r4 = *(u64 *)(r10 -8)
..
0: (85) call bpf_get_prandom_u32#7
1: (bf) r3 = r10
2: (55) if r3 != 0x7b goto pc+0
3: (7b) *(u64 *)(r3 -8) = r0
4: (79) r4 = *(u64 *)(r10 -8)
When backtracking need to mark R4 it will mark slot fp-8.
But ST or STX into fp-8 could belong to the same block of instructions.
When backtracing is done the parent state may have fp-8 slot
as "unallocated stack". Which will cause verifier to warn
and incorrectly reject such programs.
Writes into stack via non-R10 register are rare. llvm always
generates canonical stack spill/fill.
For such pathological case fall back to conservative precision
tracking instead of rejecting.
Donald Sharp [Wed, 4 Sep 2019 14:11:58 +0000 (10:11 -0400)]
net: Properly update v4 routes with v6 nexthop
When creating a v4 route that uses a v6 nexthop from a nexthop group.
Allow the kernel to properly send the nexthop as v6 via the RTA_VIA
attribute.
Broken behavior:
$ ip nexthop add via fe80::9 dev eth0
$ ip nexthop show
id 1 via fe80::9 dev eth0 scope link
$ ip route add 4.5.6.7/32 nhid 1
$ ip route show
default via 10.0.2.2 dev eth0
4.5.6.7 nhid 1 via 254.128.0.0 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
$
Fixed behavior:
$ ip nexthop add via fe80::9 dev eth0
$ ip nexthop show
id 1 via fe80::9 dev eth0 scope link
$ ip route add 4.5.6.7/32 nhid 1
$ ip route show
default via 10.0.2.2 dev eth0
4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
$
v2, v3: Addresses code review comments from David Ahern
Fixes: 5828a9bc7f2e (“ipv4: Prepare for fib6_nh from a nexthop object”) Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 3 Sep 2019 22:22:13 +0000 (15:22 -0700)]
selftest: A few cleanups for fib_nexthops.sh
Cleanups of the tests in fib_nexthops.sh
1. Several tests noted unexpected route output, but the
discrepancy was not showing in the summary output and
overlooked in the verbose output. Add a WARNING message
to the summary output to make it clear a test is not showing
expected output.
2. Several check_* calls are missing extra data like scope and metric
causing mismatches when the nexthops or routes are correct - some of
them are a side effect of the evolving iproute2 command. Update the
data to the expected output.
3. Several check_routes are checking for the wrong nexthop data,
most likely a copy-paste-update error.
4. A couple of tests were re-using a nexthop id that already existed.
Fix those to use a new id.
Fixes: bf0523ebf230 ("selftests: Add test cases for nexthop objects") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Tue, 3 Sep 2019 22:22:12 +0000 (15:22 -0700)]
ipv6: Fix RTA_MULTIPATH with nexthop objects
A change to the core nla helpers was missed during the push of
the nexthop changes. rt6_fill_node_nexthop should be calling
nla_nest_start_noflag not nla_nest_start. Currently, iproute2
does not print multipath data because of parsing issues with
the attribute.
Fixes: fe5a5e47e048 ("ipv6: Plumb support for nexthop object in a fib6_info") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John Fastabend [Tue, 3 Sep 2019 20:24:50 +0000 (13:24 -0700)]
net: sock_map, fix missing ulp check in sock hash case
sock_map and ULP only work together when ULP is loaded after the sock
map is loaded. In the sock_map case we added a check for this to fail
the load if ULP is already set. However, we missed the check on the
sock_hash side.
Add a ULP check to the sock_hash update path.
Fixes: eaf66302e4568 ("bpf, sockmap: convert to generic sk_msg interface") Reported-by: syzbot+7a6ee4d0078eac6bf782@syzkaller.appspotmail.com Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Moritz Fischer [Tue, 3 Sep 2019 18:46:52 +0000 (11:46 -0700)]
net: fixed_phy: Add forward declaration for struct gpio_desc;
Add forward declaration for struct gpio_desc in order to address
the following:
./include/linux/phy_fixed.h:48:17: error: 'struct gpio_desc' declared inside parameter list [-Werror]
./include/linux/phy_fixed.h:48:17: error: its scope is only this definition or declaration, which is probably not what you want [-Werror]
Fixes: 9d0943536a0d ("net: fixed-phy: Add fixed_phy_register_with_gpiod() API") Signed-off-by: Moritz Fischer <mdf@kernel.org> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: e5a0a991bd26 ("tipc: convert name table read-write lock to RCU") Reported-by: Li Shuang <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 5 Sep 2019 06:31:53 +0000 (08:31 +0200)]
Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:
====================
pull request: bluetooth 2019-09-05
Here are a few more Bluetooth fixes for 5.3. I hope they can still make
it. There's one USB ID addition for btusb, two reverts due to discovered
regressions, and two other important fixes.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
There are devices which require low connection intervals for usable operation
including keyboards and mice. Forcing a static connection interval for
these types of devices has an impact in latency and causes a regression.
Signed-off-by: Marcel Holtmann <marcel@holtmann.org> Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
net-ipv6: fix excessive RTF_ADDRCONF flag on ::1/128 local route (and others)
There is a subtle change in behaviour introduced by:
commit 2145117734cbf5e3bec67adadd6ca3fb5f31d7ca
'ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create'
ie. the above commit causes the ::1/128 local (automatic) route to be flagged with RTF_ADDRCONF (0x040000).
AFAICT, this is incorrect since these routes are *not* coming from RA's.
As such, this patch restores the old behaviour.
Fixes: 2145117734cb ("ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create") Cc: David Ahern <dsahern@gmail.com> Cc: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Mon, 2 Sep 2019 15:24:21 +0000 (23:24 +0800)]
sctp: use transport pf_retrans in sctp_do_8_2_transport_strike
Transport should use its own pf_retrans to do the error_count
check, instead of asoc's. Otherwise, it's meaningless to make
pf_retrans per transport.
Fixes: ecdfca015bbb ("sctp: Implement quick failover draft from tsvwg") Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Mon, 2 Sep 2019 10:34:08 +0000 (11:34 +0100)]
rxrpc: Fix misplaced traceline
There's a misplaced traceline in rxrpc_input_packet() which is looking at a
packet that just got released rather than the replacement packet.
Fix this by moving the traceline after the assignment that moves the new
packet pointer to the actual packet pointer.
Fixes: 9f1a735c15df ("rxrpc: Use skb_unshare() rather than skb_cow_data()") Reported-by: Hillf Danton <hdanton@sina.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Bluetooth: hci_qca: disable irqs when spinlock is acquired
Looks like Deadlock is observed in hci_qca while performing
stress and stability tests. Since same lock is getting
acquired from qca_wq_awake_rx and hci_ibs_tx_idle_timeout
seeing spinlock recursion, irqs should be disable while
acquiring the spinlock always.
Leonardo Bras [Fri, 30 Aug 2019 18:13:53 +0000 (15:13 -0300)]
netfilter: nft_fib_netdev: Terminate rule eval if protocol=IPv6 and ipv6 module is disabled
If IPv6 is disabled on boot (ipv6.disable=1), but nft_fib_inet ends up
dealing with a IPv6 packet, it causes a kernel panic in
fib6_node_lookup_1(), crashing in bad_page_fault.
The panic is caused by trying to deference a very low address (0x38
in ppc64le), due to ipv6.fib6_main_tbl = NULL.
BUG: Kernel NULL pointer dereference at 0x00000038
The kernel panic was reproduced in a host that disabled IPv6 on boot and
have to process guest packets (coming from a bridge) using it's ip6tables.
Terminate rule evaluation when packet protocol is IPv6 but the ipv6 module
is not loaded.
Signed-off-by: Leonardo Bras <leonardo@linux.ibm.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Hui Peng [Mon, 19 Aug 2019 22:02:29 +0000 (18:02 -0400)]
rsi: fix a double free bug in rsi_91x_deinit()
`dev` (struct rsi_91x_usbdev *) field of adapter
(struct rsi_91x_usbdev *) is allocated and initialized in
`rsi_init_usb_interface`. If any error is detected in information
read from the device side, `rsi_init_usb_interface` will be
freed. However, in the higher level error handling code in
`rsi_probe`, if error is detected, `rsi_91x_deinit` is called
again, in which `dev` will be freed again, resulting double free.
This patch fixes the double free by removing the free operation on
`dev` in `rsi_init_usb_interface`, because `rsi_91x_deinit` is also
used in `rsi_disconnect`, in that code path, the `dev` field is not
(and thus needs to be) freed.
This bug was found in v4.19, but is also present in the latest version
of kernel. Fixes CVE-2019-15504.
Reported-by: Hui Peng <benquike@gmail.com> Reported-by: Mathias Payer <mathias.payer@nebelwelt.net> Signed-off-by: Hui Peng <benquike@gmail.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
"I got some problem after upgrade kernel to 5.2 version (debian testing
linux-image-5.2.0-2-amd64). 5Ghz client stopped to see AP.
Some tests with 1metre distance between client-AP: 2.4Ghz -22dBm, for
5Ghz - 53dBm !, for longer distance (8m + walls) 2.4 - 61dBm, 5Ghz not
visible."
It was identified that rx signal level degradation was caused by d0d1da04dcf2 ("rt2800: enable TX_PIN_CFG_LNA_PE_ bits per band").
So revert this commit.
Cc: <stable@vger.kernel.org> # v5.1+ Reported-and-tested-by: Sergey Maranchuk <slav0nic0@gmail.com> Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
After looking at code I realized that my previous fix 8baaeacb8443 ("rt2x00: clear IV's on start to fix AP mode regression")
was incomplete. We can still have wrong IV's after re-keyring.
To fix that, clear up IV's also on key removal.
Fixes: f41636d69ba5 ("rt2800: do not nullify initialization vector data") Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
tested-by: Emil Karlson <jekarl@iki.fi> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Luca Coelho [Thu, 29 Aug 2019 08:13:46 +0000 (11:13 +0300)]
iwlwifi: assign directly to iwl_trans->cfg in QuZ detection
We were erroneously assigning the new configuration to a local
variable cfg, but that was not being assigned to anything, so the
change was getting lost. Assign directly to iwl_trans->cfg instead.
Fixes: 3f8960b64df8 ("iwlwifi: pcie: fix recognition of QuZ devices") Cc: stable@vger.kernel.org # 5.2 Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Wen Huang [Wed, 28 Aug 2019 02:07:51 +0000 (10:07 +0800)]
mwifiex: Fix three heap overflow at parsing element in cfg80211_ap_settings
mwifiex_update_vs_ie(),mwifiex_set_uap_rates() and
mwifiex_set_wmm_params() call memcpy() without checking
the destination size.Since the source is given from
user-space, this may trigger a heap buffer overflow.
Fix them by putting the length check before performing memcpy().
This fix addresses CVE-2019-14814,CVE-2019-14815,CVE-2019-14816.
Signed-off-by: Wen Huang <huangwenabc@gmail.com> Acked-by: Ganapathi Bhat <gbhat@marvell.comg> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
MT7630E hardware does support 5GHz, but we do not properly configure phy
for 5GHz channels. Scanning at this band not only do not show any APs
but also can hang the firmware.
Since vendor reference driver do not support 5GHz we don't know how
properly configure 5GHz channels. So disable this band for MT7630E .
Cc: stable@vger.kernel.org Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
mt76: mt76x0e: don't use hw encryption for MT7630E
Since 5e543acb41fa ("mt76: only schedule txqs from the tx tasklet")
I can observe firmware hangs on MT7630E on station mode: tx stop
functioning after minor activity (rx keep working) and on module
unload device fail to stop with messages:
[ 5446.141413] mt76x0e 0000:06:00.0: TX DMA did not stop
[ 5449.176764] mt76x0e 0000:06:00.0: TX DMA did not stop
Loading module again results in failure to associate with AP.
Only machine power off / power on cycle can make device work again.
It's unclear why commit 5e543acb41fa causes the problem, but it is
related to HW encryption. Since issue is a firmware hang, that is super
hard to debug, just disable HW encryption as fix for the issue.
Fixes: 5e543acb41fa ("mt76: only schedule txqs from the tx tasklet") Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Yizhuo [Sat, 31 Aug 2019 02:00:48 +0000 (19:00 -0700)]
net: stmmac: dwmac-sun8i: Variable "val" in function sun8i_dwmac_set_syscon() could be uninitialized
In function sun8i_dwmac_set_syscon(), local variable "val" could
be uninitialized if function regmap_field_read() returns -EINVAL.
However, it will be used directly in the if statement, which
is potentially unsafe.
Signed-off-by: Yizhuo <yzhai003@ucr.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge tag 'char-misc-5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small char and misc driver fixes for reported issues for
5.3-rc7
Also included in here is the documentation for how we are handling
hardware issues under embargo that everyone has finally agreed on, as
well as a MAINTAINERS update for the suckers who agreed to handle the
LICENSES/ files.
All of these have been in linux-next last week with no reported
issues"
* tag 'char-misc-5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
fsi: scom: Don't abort operations for minor errors
vmw_balloon: Fix offline page marking with compaction
VMCI: Release resource if the work is already queued
Documentation/process: Embargoed hardware security issues
lkdtm/bugs: fix build error in lkdtm_EXHAUST_STACK
mei: me: add Tiger Lake point LP device ID
intel_th: pci: Add Tiger Lake support
intel_th: pci: Add support for another Lewisburg PCH
stm class: Fix a double free of stm_source_device
MAINTAINERS: add entry for LICENSES and SPDX stuff
fpga: altera-ps-spi: Fix getting of optional confd gpio
Merge tag 'usb-5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB fixes that have been in linux-next this past
week for 5.3-rc7
They fix the usual xhci, syzbot reports, and other small issues that
have come up last week.
All have been in linux-next with no reported issues"
* tag 'usb-5.3-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
USB: cdc-wdm: fix race between write and disconnect due to flag abuse
usb: host: xhci: rcar: Fix typo in compatible string matching
usb: host: xhci-tegra: Set DMA mask correctly
USB: storage: ums-realtek: Whitelist auto-delink support
USB: storage: ums-realtek: Update module parameter description for auto_delink_en
usb: host: ohci: fix a race condition between shutdown and irq
usb: hcd: use managed device resources
typec: tcpm: fix a typo in the comparison of pdo_max_voltage
usb-storage: Add new JMS567 revision to unusual_devs
usb: chipidea: udc: don't do hardware access if gadget has stopped
usbtmc: more sanity checking for packet size
usb: udc: lpc32xx: silence fall-through warning
1) Fix some length checks during OGM processing in batman-adv, from
Sven Eckelmann.
2) Fix regression that caused netfilter conntrack sysctls to not be
per-netns any more. From Florian Westphal.
3) Use after free in netpoll, from Feng Sun.
4) Guard destruction of pfifo_fast per-cpu qdisc stats with
qdisc_is_percpu_stats(), from Davide Caratti. Similar bug is fixed
in pfifo_fast_enqueue().
5) Fix memory leak in mld_del_delrec(), from Eric Dumazet.
6) Handle neigh events on internal ports correctly in nfp, from John
Hurley.
7) Clear SKB timestamp in NF flow table code so that it does not
confuse fq scheduler. From Florian Westphal.
8) taprio destroy can crash if it is invoked in a failure path of
taprio_init(), because the list head isn't setup properly yet and
the list del is unconditional. Perform the list add earlier to
address this. From Vladimir Oltean.
9) Make sure to reapply vlan filters on device up, in aquantia driver.
From Dmitry Bogdanov.
10) sgiseeq driver releases DMA memory using free_page() instead of
dma_free_attrs(). From Christophe JAILLET.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (58 commits)
net: seeq: Fix the function used to release some memory in an error handling path
enetc: Add missing call to 'pci_free_irq_vectors()' in probe and remove functions
net: bcmgenet: use ethtool_op_get_ts_info()
tc-testing: don't hardcode 'ip' in nsPlugin.py
net: dsa: microchip: add KSZ8563 compatibility string
dt-bindings: net: dsa: document additional Microchip KSZ8563 switch
net: aquantia: fix out of memory condition on rx side
net: aquantia: linkstate irq should be oneshot
net: aquantia: reapply vlan filters on up
net: aquantia: fix limit of vlan filters
net: aquantia: fix removal of vlan 0
net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate
taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
taprio: Fix kernel panic in taprio_destroy
net: dsa: microchip: fill regmap_config name
rxrpc: Fix lack of conn cleanup when local endpoint is cleaned up [ver #2]
net: stmmac: dwmac-rk: Don't fail if phy regulator is absent
amd-xgbe: Fix error path in xgbe_mod_init()
netfilter: nft_meta_bridge: Fix get NFT_META_BRI_IIFVPROTO in network byteorder
mac80211: Correctly set noencrypt for PAE frames
...
net: seeq: Fix the function used to release some memory in an error handling path
In commit 47330dd6483f ("sgiseeq: replace use of dma_cache_wback_inv"),
a call to 'get_zeroed_page()' has been turned into a call to
'dma_alloc_coherent()'. Only the remove function has been updated to turn
the corresponding 'free_page()' into 'dma_free_attrs()'.
The error hndling path of the probe function has not been updated.
Fix it now.
Rename the corresponding label to something more in line.
Fixes: 47330dd6483f ("sgiseeq: replace use of dma_cache_wback_inv") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Reviewed-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of fixes for x86:
- Fix the bogus detection of 32bit user mode for uretprobes which
caused corruption of the user return address resulting in
application crashes. In the uprobes handler in_ia32_syscall() is
obviously always returning false on a 64bit kernel. Use
user_64bit_mode() instead which works correctly.
- Prevent large page splitting when ftrace flips RW/RO on the kernel
text which caused iTLB performance issues. Ftrace wants to be
converted to text_poke() which avoids the problem, but for now
allow large page preservation in the static protections check when
the change request spawns a full large page.
- Prevent arch_dynirq_lower_bound() from returning 0 when the IOAPIC
is configured via device tree. In the device tree case the GSI 1:1
mapping is meaningless therefore the lower bound which protects the
GSI range on ACPI machines is irrelevant. Return the lower bound
which the core hands to the function instead of blindly returning 0
which causes the core to allocate the invalid virtual interupt
number 0 which in turn prevents all drivers from allocating and
requesting an interrupt.
- Remove the bogus initialization of LDR and DFR in the 32bit bigsmp
APIC driver. That uses physical destination mode where LDR/DFR are
ignored, but the initialization and the missing clear of LDR caused
the APIC to be left in a inconsistent state on kexec/reboot.
- Clear LDR when clearing the APIC registers so the APIC is in a well
defined state.
- Initialize variables proper in the find_trampoline_placement()
code.
- Silence GCC( build warning for the real mode part of the build"
* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text
x86/build: Add -Wnoaddress-of-packed-member to REALMODE_CFLAGS, to silence GCC9 build warning
x86/boot/compressed/64: Fix missing initialization in find_trampoline_placement()
x86/apic: Include the LDR when clearing out APIC registers
x86/apic: Do not initialize LDR and DFR for bigsmp
uprobes/x86: Fix detection of 32-bit user mode
x86/apic: Fix arch_dynirq_lower_bound() bug for DT enabled machines
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf fixes from Thomas Gleixner:
"Two fixes for perf x86 hardware implementations:
- Restrict the period on Nehalem machines to prevent perf from
hogging the CPU
- Prevent the AMD IBS driver from overwriting the hardwre controlled
and pre-seeded reserved bits (0-6) in the count register which
caused a sample bias for dispatched micro-ops"
* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf/x86/amd/ibs: Fix sample bias for dispatched micro-ops
perf/x86/intel: Restrict period on Nehalem
Ryan M. Collins [Fri, 30 Aug 2019 18:49:55 +0000 (14:49 -0400)]
net: bcmgenet: use ethtool_op_get_ts_info()
This change enables the use of SW timestamping on the Raspberry Pi 4.
bcmgenet's transmit function bcmgenet_xmit() implements software
timestamping. However the SOF_TIMESTAMPING_TX_SOFTWARE capability was
missing and only SOF_TIMESTAMPING_RX_SOFTWARE was announced. By using
ethtool_ops bcmgenet_ethtool_ops() as get_ts_info(), the
SOF_TIMESTAMPING_TX_SOFTWARE capability is announced.
Similar to commit 5908ea3999fe ("smsc95xx: use ethtool_op_get_ts_info()")
Signed-off-by: Ryan M. Collins <rmc032@bucknell.edu> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Acked-by: Doug Berger <opendmb@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Davide Caratti [Fri, 30 Aug 2019 16:51:47 +0000 (18:51 +0200)]
tc-testing: don't hardcode 'ip' in nsPlugin.py
the following tdc test fails on Fedora:
# ./tdc.py -e 2638
-- ns/SubPlugin.__init__
Test 2638: Add matchall and try to get it
-----> prepare stage *** Could not execute: "$TC qdisc add dev $DEV1 clsact"
-----> prepare stage *** Error message: "/bin/sh: ip: command not found"
returncode 127; expected [0]
-----> prepare stage *** Aborting test run.
Let nsPlugin.py use the 'IP' variable introduced with commit b4f2e5dcc3ca
("tc-tests: added path to ip command in tdc"), so that the path to 'ip' is
correctly resolved to the value we have in tdc_config.py.
# ./tdc.py -e 2638
-- ns/SubPlugin.__init__
Test 2638: Add matchall and try to get it
All test results:
1..1
ok 1 2638 - Add matchall and try to get it
Fixes: 60beea0116ca ("tc-testing: Restore original behaviour for namespaces in tdc") Reported-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Bogdanov [Fri, 30 Aug 2019 12:08:38 +0000 (12:08 +0000)]
net: aquantia: fix out of memory condition on rx side
On embedded environments with hard memory limits it is a normal although
rare case when skb can't be allocated on rx part under high traffic.
In such OOM cases napi_complete_done() was not called.
So the napi object became in an invalid state like it is "scheduled".
Kernel do not re-schedules the poll of that napi object.
Consequently, kernel can not remove that object the system hangs on
`ifconfig down` waiting for a poll.
We are fixing this by gracefully closing napi poll routine with correct
invocation of napi_complete_done.
This was reproduced with artificially failing the allocation of skb to
simulate an "out of memory" error case and check that traffic does
not get stuck.
Fixes: 2e7ff1c63234 ("net: ethernet: aquantia: Vector operations") Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Fri, 30 Aug 2019 12:08:36 +0000 (12:08 +0000)]
net: aquantia: linkstate irq should be oneshot
Declaring threaded irq handler should also indicate the irq is
oneshot. It is oneshot indeed, because HW implements irq automasking
on trigger.
Not declaring this causes some kernel configurations to fail
on interface up, because request_threaded_irq returned an err code.
The issue was originally hidden on normal x86_64 configuration with
latest kernel, because depending on interrupt controller, irq driver
added ONESHOT flag on its own.
Issue was observed on older kernels (4.14) where no such logic exists.
Fixes: ccf377d0059c ("net: aquantia: link status irq handling") Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Reported-by: Michael Symolkin <Michael.Symolkin@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Bogdanov [Fri, 30 Aug 2019 12:08:35 +0000 (12:08 +0000)]
net: aquantia: reapply vlan filters on up
In case of device reconfiguration the driver may reset the device invisible
for other modules, vlan module in particular. So vlans will not be
removed&created and vlan filters will not be configured in the device.
The patch reapplies the vlan filters at device start.
Fixes: 043044d8cc0ab ("net: aquantia: add support of rx-vlan-filter offload") Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Bogdanov [Fri, 30 Aug 2019 12:08:33 +0000 (12:08 +0000)]
net: aquantia: fix limit of vlan filters
Fix a limit condition of vlans on the interface before setting vlan
promiscuous mode
Fixes: b32b23952b941 ("net: aquantia: fix vlans not working over bridged network") Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Bogdanov [Fri, 30 Aug 2019 12:08:30 +0000 (12:08 +0000)]
net: aquantia: fix removal of vlan 0
Due to absence of checking against the rx flow rule when vlan 0 is being
removed, the other rule could be removed instead of the rule with vlan 0
Fixes: 043044d8cc0ab ("net: aquantia: add support of rx-vlan-filter offload") Signed-off-by: Dmitry Bogdanov <dmitry.bogdanov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 30 Aug 2019 01:07:23 +0000 (04:07 +0300)]
net/sched: cbs: Set default link speed to 10 Mbps in cbs_set_port_rate
The discussion to be made is absolutely the same as in the case of
previous patch ("taprio: Set default link speed to 10 Mbps in
taprio_set_picos_per_byte"). Nothing is lost when setting a default.
Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com> Fixes: 84c2f0b3e8ac ("net/sched: cbs: fix port_rate miscalculation") Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 30 Aug 2019 01:07:22 +0000 (04:07 +0300)]
taprio: Set default link speed to 10 Mbps in taprio_set_picos_per_byte
The taprio budget needs to be adapted at runtime according to interface
link speed. But that handling is problematic.
For one thing, installing a qdisc on an interface that doesn't have
carrier is not illegal. But taprio prints the following stack trace:
[ 31.851373] ------------[ cut here ]------------
[ 31.856024] WARNING: CPU: 1 PID: 207 at net/sched/sch_taprio.c:481 taprio_dequeue+0x1a8/0x2d4
[ 31.864566] taprio: dequeue() called with unknown picos per byte.
[ 31.864570] Modules linked in:
[ 31.873701] CPU: 1 PID: 207 Comm: tc Not tainted 5.3.0-rc5-01199-g8838fe023cd6 #1689
[ 31.881398] Hardware name: Freescale LS1021A
[ 31.885661] [<c03133a4>] (unwind_backtrace) from [<c030d8cc>] (show_stack+0x10/0x14)
[ 31.893368] [<c030d8cc>] (show_stack) from [<c10ac958>] (dump_stack+0xb4/0xc8)
[ 31.900555] [<c10ac958>] (dump_stack) from [<c0349d04>] (__warn+0xe0/0xf8)
[ 31.907395] [<c0349d04>] (__warn) from [<c0349d64>] (warn_slowpath_fmt+0x48/0x6c)
[ 31.914841] [<c0349d64>] (warn_slowpath_fmt) from [<c0f38db4>] (taprio_dequeue+0x1a8/0x2d4)
[ 31.923150] [<c0f38db4>] (taprio_dequeue) from [<c0f227b0>] (__qdisc_run+0x90/0x61c)
[ 31.930856] [<c0f227b0>] (__qdisc_run) from [<c0ec82ac>] (net_tx_action+0x12c/0x2bc)
[ 31.938560] [<c0ec82ac>] (net_tx_action) from [<c0302298>] (__do_softirq+0x130/0x3c8)
[ 31.946350] [<c0302298>] (__do_softirq) from [<c03502a0>] (irq_exit+0xbc/0xd8)
[ 31.953536] [<c03502a0>] (irq_exit) from [<c03a4808>] (__handle_domain_irq+0x60/0xb4)
[ 31.961328] [<c03a4808>] (__handle_domain_irq) from [<c0754478>] (gic_handle_irq+0x58/0x9c)
[ 31.969638] [<c0754478>] (gic_handle_irq) from [<c0301a8c>] (__irq_svc+0x6c/0x90)
[ 31.977076] Exception stack(0xe8167b20 to 0xe8167b68)
[ 31.982100] 7b20: e9d4bd8000000cc0000000cf00000000e9d4bd80c1f3895800000cc0c1f38960
[ 31.990234] 7b40: 00000001000000cf00000004e9dc080000000000e8167b70c0f478ecc0f46d94
[ 31.998363] 7b60: 60070013ffffffff
[ 32.001833] [<c0301a8c>] (__irq_svc) from [<c0f46d94>] (netlink_trim+0x18/0xd8)
[ 32.009104] [<c0f46d94>] (netlink_trim) from [<c0f478ec>] (netlink_broadcast_filtered+0x34/0x414)
[ 32.017930] [<c0f478ec>] (netlink_broadcast_filtered) from [<c0f47cec>] (netlink_broadcast+0x20/0x28)
[ 32.027102] [<c0f47cec>] (netlink_broadcast) from [<c0eea378>] (rtnetlink_send+0x34/0x88)
[ 32.035238] [<c0eea378>] (rtnetlink_send) from [<c0f25890>] (notify_and_destroy+0x2c/0x44)
[ 32.043461] [<c0f25890>] (notify_and_destroy) from [<c0f25e08>] (qdisc_graft+0x398/0x470)
[ 32.051595] [<c0f25e08>] (qdisc_graft) from [<c0f27a00>] (tc_modify_qdisc+0x3a4/0x724)
[ 32.059470] [<c0f27a00>] (tc_modify_qdisc) from [<c0ee4c84>] (rtnetlink_rcv_msg+0x260/0x2ec)
[ 32.067864] [<c0ee4c84>] (rtnetlink_rcv_msg) from [<c0f4a988>] (netlink_rcv_skb+0xb8/0x110)
[ 32.076172] [<c0f4a988>] (netlink_rcv_skb) from [<c0f4a170>] (netlink_unicast+0x1b4/0x22c)
[ 32.084392] [<c0f4a170>] (netlink_unicast) from [<c0f4a5e4>] (netlink_sendmsg+0x33c/0x380)
[ 32.092614] [<c0f4a5e4>] (netlink_sendmsg) from [<c0ea9f40>] (sock_sendmsg+0x14/0x24)
[ 32.100403] [<c0ea9f40>] (sock_sendmsg) from [<c0eaa780>] (___sys_sendmsg+0x214/0x228)
[ 32.108279] [<c0eaa780>] (___sys_sendmsg) from [<c0eabad0>] (__sys_sendmsg+0x50/0x8c)
[ 32.116068] [<c0eabad0>] (__sys_sendmsg) from [<c0301000>] (ret_fast_syscall+0x0/0x54)
[ 32.123938] Exception stack(0xe8167fa8 to 0xe8167ff0)
[ 32.128960] 7fa0: b6fa68c8000000f800000003bea142d00000000000000000
[ 32.137093] 7fc0: b6fa68c8000000f80052154c000001285d6468a2000000000000002800558c9c
[ 32.145224] 7fe0: 00000070bea1427800530d64b6e17e64
[ 32.150659] ---[ end trace 2139c9827c3e5177 ]---
This happens because the qdisc ->dequeue callback gets called. Which
again is not illegal, the qdisc will dequeue even when the interface is
up but doesn't have carrier (and hence SPEED_UNKNOWN), and the frames
will be dropped further down the stack in dev_direct_xmit().
And, at the end of the day, for what? For calculating the initial budget
of an interface which is non-operational at the moment and where frames
will get dropped anyway.
So if we can't figure out the link speed, default to SPEED_10 and move
along. We can also remove the runtime check now.
Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com> Fixes: 3578d8eccf83 ("net/sched: taprio: fix picos_per_byte miscalculation") Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vladimir Oltean [Fri, 30 Aug 2019 01:07:21 +0000 (04:07 +0300)]
taprio: Fix kernel panic in taprio_destroy
taprio_init may fail earlier than this line:
list_add(&q->taprio_list, &taprio_list);
i.e. due to the net device not being multi queue.
Attempting to remove q from the global taprio_list when it is not part
of it will result in a kernel panic.
Fix it by matching list_add and list_del better to one another in the
order of operations. This way we can keep the deletion unconditional
and with lower complexity - O(1).
Cc: Leandro Dorileo <leandro.maciel.dorileo@intel.com> Fixes: 3578d8eccf83 ("net/sched: taprio: fix picos_per_byte miscalculation") Signed-off-by: Vladimir Oltean <olteanv@gmail.com> Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Use the register value width as the regmap_config name to prevent the
following error when the second and third regmap_configs are
initialized.
"debugfs: Directory '${bus-id}' with parent 'regmap' already present!"
Signed-off-by: George McCollister <george.mccollister@gmail.com> Reviewed-by: Marek Vasut <marex@denx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Pu Wen [Sat, 31 Aug 2019 02:20:31 +0000 (10:20 +0800)]
tools/power turbostat: Add support for Hygon Fam 18h (Dhyana) RAPL
Commit 3975338c4deee573d0a2 ("tools/power turbostat: Add support for AMD
Fam 17h (Zen) RAPL") and the commit e64e869ce0d439da9fb2 ("tools/power
turbostat: Also read package power on AMD F17h (Zen)") add AMD Fam 17h
RAPL support.
Hygon Family 18h(Dhyana) support RAPL in bit 14 of CPUID 0x80000007 EDX,
and has MSRs RAPL_PWR_UNIT/CORE_ENERGY_STAT/PKG_ENERGY_STAT. So add Hygon
Dhyana Family 18h support for RAPL.
Already tested on Hygon multi-node systems and it shows correct per-core
energy usage and the total package power.
Signed-off-by: Pu Wen <puwen@hygon.cn> Reviewed-by: Calvin Walton <calvin.walton@kepstin.ca> Signed-off-by: Len Brown <len.brown@intel.com>
Pu Wen [Sat, 31 Aug 2019 02:19:58 +0000 (10:19 +0800)]
tools/power turbostat: Fix caller parameter of get_tdp_amd()
Commit 3975338c4deee573d0a2 ("tools/power turbostat: Add support for AMD
Fam 17h (Zen) RAPL") add a function get_tdp_amd(), the parameter is CPU
family. But the rapl_probe_amd() function use wrong model parameter.
Fix the wrong caller parameter of get_tdp_amd() to use family.
Cc: <stable@vger.kernel.org> # v5.1+ Signed-off-by: Pu Wen <puwen@hygon.cn> Reviewed-by: Calvin Walton <calvin.walton@kepstin.ca> Signed-off-by: Len Brown <len.brown@intel.com>
In some case C1% will be wrong value, when platform doesn't have MSR for
C1 residency.
For example:
Core CPU CPU%c1
- - 100.00
0 0 100.00
0 2 100.00
1 1 100.00
1 3 100.00
But adding Busy% will fix this
Core CPU Busy% CPU%c1
- - 99.77 0.23
0 0 99.77 0.23
0 2 99.77 0.23
1 1 99.77 0.23
1 3 99.77 0.23
This issue can be reproduced on most of the recent systems including
Broadwell, Skylake and later.
This is because if we don't select Busy% or Avg_MHz or Bzy_MHz then
mperf value will not be read from MSR, so it will be 0. But this
is required for C1% calculation when MSR for C1 residency is not present.
Same is true for C3, C6 and C7 column selection.
So add another define DO_BIC_READ(), which doesn't depend on user
column selection and use for mperf, C3, C6 and C7 related counters.
So when there is no platform support for C1 residency counters,
we still read these counters, if the CPU has support and user selected
display of CPU%c1.
Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
Artem Bityutskiy [Wed, 14 Aug 2019 17:12:56 +0000 (20:12 +0300)]
tools/power turbostat: do not enforce 1ms
Turbostat works by taking a snapshot of counters, sleeping, taking another
snapshot, calculating deltas, and printing out the table.
The sleep time is controlled via -i option or by user sending a signal or a
character to stdin. In the latter case, turbostat always adds 1 ms
sleep before it reads the counters, in order to avoid larger imprecisions
in the results in prints.
While the 1 ms delay may be a good idea for a "dumb" user, it is a
problem for an "aware" user. I do thousands and thousands of measurements
over a short period of time (like 2ms), and turbostat unconditionally adds
a 1ms to my interval, so I cannot get what I really need.
This patch removes the unconditional 1ms sleep. This is an expert user
tool, after all, and non-experts will unlikely ever use it in the non-fixed
interval mode anyway, so I think it is OK to remove the 1ms delay.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
Artem Bityutskiy [Wed, 14 Aug 2019 17:12:55 +0000 (20:12 +0300)]
tools/power turbostat: read from pipes too
Commit 'f87fc7a99897 tools/power turbostat: fix printing on input' make
a valid fix, but it completely disabled piped stdin support, which is
a valuable use-case. Indeed, if stdin is a pipe, turbostat won't read
anything from it, so it becomes impossible to get turbostat output at
user-defined moments, instead of the regular intervals.
There is no reason why this should works for terminals, but not for
pipes. This patch improves the situation. Instead of ignoring pipes, we
read data from them but gracefully handle the EOF case.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>
This enables turbostat utility on Ice Lake NNPI SoC.
Link: https://lkml.org/lkml/2019/6/5/1034 Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@linux.intel.com> Signed-off-by: Len Brown <len.brown@intel.com>