git.baikalelectronics.ru Git

mlxsw: spectrum_switchdev: Fix port_vlan refcounting

Switchdev notifications for addition of SWITCHDEV_OBJ_ID_PORT_VLAN are
distributed not only on clean addition, but also when flags on an
existing VLAN are changed. mlxsw_sp_bridge_port_vlan_add() calls
mlxsw_sp_port_vlan_get() to get at the port_vlan in question, which
implicitly references the object. This then leads to discrepancies in
reference counting when the VLAN is removed. spectrum.c warns about the
problem when the module is removed:

[13578.493090] WARNING: CPU: 0 PID: 2454 at drivers/net/ethernet/mellanox/mlxsw/spectrum.c:2973 mlxsw_sp_port_remove+0xfd/0x110 [mlxsw_spectrum]
[...]
[13578.627106] Call Trace:
[13578.629617]  mlxsw_sp_fini+0x2a/0xe0 [mlxsw_spectrum]
[13578.634748]  mlxsw_core_bus_device_unregister+0x3e/0x130 [mlxsw_core]
[13578.641290]  mlxsw_pci_remove+0x13/0x40 [mlxsw_pci]
[13578.646238]  pci_device_remove+0x31/0xb0
[13578.650244]  device_release_driver_internal+0x14f/0x220
[13578.655562]  driver_detach+0x32/0x70
[13578.659183]  bus_remove_driver+0x47/0xa0
[13578.663134]  pci_unregister_driver+0x1e/0x80
[13578.667486]  mlxsw_sp_module_exit+0xc/0x3fa [mlxsw_spectrum]
[13578.673207]  __x64_sys_delete_module+0x13b/0x1e0
[13578.677888]  ? exit_to_usermode_loop+0x78/0x80
[13578.682374]  do_syscall_64+0x39/0xe0
[13578.685976]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix by putting the port_vlan when mlxsw_sp_port_vlan_bridge_join()
determines it's a flag-only change.

Fixes: 77d86be8fe6f ("spectrum: Reference count VLAN entries")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Align with new route replace logic

Commit 2146678a953c ("net/ipv6: Simplify route replace and appending
into multipath route") changed the IPv6 route replace logic so that the
first matching route (i.e., same metric) is replaced.

Have mlxsw replace the first matching route as well.

Fixes: 2146678a953c ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_router: Allow appending to dev-only routes

Commit 2146678a953c ("net/ipv6: Simplify route replace and appending
into multipath route") changed the IPv6 route append logic so that
dev-only routes can be appended and not only gatewayed routes.

Align mlxsw with the new behaviour.

Fixes: 2146678a953c ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: Only emit append events for appended routes

Current code will emit an append event in the FIB notification chain for
any route added with NLM_F_APPEND set, even if the route was not
appended to any existing route.

This is inconsistent with IPv4 where such an event is only emitted when
the new route is appended after an existing one.

Align IPv6 behavior with IPv4, thereby allowing listeners to more easily
handle these events.

Fixes: 2146678a953c ("net/ipv6: Simplify route replace and appending into multipath route")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mac80211-for-davem-2018-06-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A handful of fixes:
* missing RCU grace period enforcement led to drivers freeing
   data structures before; fix from Dedy Lansky.
* hwsim module init error paths were messed up; fixed it myself
   after a report from Colin King (who had sent a partial patch)
* kernel-doc tag errors; fix from Luca Coelho
* initialize the on-stack sinfo data structure when getting
   station information; fix from Sven Eckelmann
* TXQ state dumping is now done from init, and when TXQs aren't
   initialized yet at that point, bad things happen, move the
   initialization; fix from Toke Høiland-Jørgensen.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

stmmac: added support for 802.1ad vlan stripping

stmmac reception handler calls stmmac_rx_vlan() to strip the vlan before
calling napi_gro_receive().

The function assumes VLAN tagged frames are always tagged with
802.1Q protocol, and assigns ETH_P_8021Q to the skb by hard-coding
the parameter on call to __vlan_hwaccel_put_tag() .

This causes packets not to be passed to the VLAN slave if it was created
with 802.1AD protocol
(ip link add link eth0 eth0.100 type vlan proto 802.1ad id 100).

This fix passes the protocol from the VLAN header into
__vlan_hwaccel_put_tag() instead of using the hard-coded value of
ETH_P_8021Q.

NETIF_F_HW_VLAN_STAG_RX check was added and the strip action is now
dependent on the correct combination of features and the detected vlan tag.

NETIF_F_HW_VLAN_STAG_RX feature was added to be in line with the driver
actual abilities.

Signed-off-by: Elad Nachman <eladn@gilat.com>
Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>

cfg80211: fix rcu in cfg80211_unregister_wdev

Callers of cfg80211_unregister_wdev can free the wdev object
immediately after this function returns. This may crash the kernel
because this wdev object is still in use by other threads.
Add synchronize_rcu() after list_del_rcu to make sure wdev object can
be safely freed.

Signed-off-by: Dedy Lansky <dlansky@codeaurora.org>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>

mac80211: Move up init of TXQs

On init, ieee80211_if_add() dumps the interface. Since that now includes a
dump of the TXQ state, we need to initialise that before the dump happens.
So move up the TXQ initialisation to to before the call to
ieee80211_if_add().

Fixes: cf2d7a84af4c ("cfg80211: Expose TXQ stats and parameters to userspace")
Reported-by: Niklas Cassel <niklas.cassel@linaro.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Tested-by: Niklas Cassel <niklas.cassel@linaro.org>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>

mac80211_hwsim: fix module init error paths

We didn't free the workqueue on any errors, nor did we
correctly check for rhashtable allocation errors, nor
did we free the hashtable on error.

Reported-by: Colin King <colin.king@canonical.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>

cfg80211: initialize sinfo in cfg80211_get_station

Most of the implementations behind cfg80211_get_station will not initialize
sinfo to zero before manipulating it. For example, the member "filled",
which indicates the filled in parts of this struct, is often only modified
by enabling certain bits in the bitfield while keeping the remaining bits
in their original state. A caller without a preinitialized sinfo.filled can
then no longer decide which parts of sinfo were filled in by
cfg80211_get_station (or actually the underlying implementations).

cfg80211_get_station must therefore take care that sinfo is initialized to
zero. Otherwise, the caller may tries to read information which was not
filled in and which must therefore also be considered uninitialized. In
batadv_v_elp_get_throughput's case, an invalid "random" expected throughput
may be stored for this neighbor and thus the B.A.T.M.A.N V algorithm may
switch to non-optimal neighbors for certain destinations.

Fixes: b5ed2f012f36 ("cfg80211: implement cfg80211_get_station cfg80211 API")
Reported-by: Thomas Lauer <holminateur@gmail.com>
Reported-by: Marcel Schmidt <ff.z-casparistrasse@mailbox.org>
Cc: b.a.t.m.a.n@lists.open-mesh.org
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>

nl80211: fix some kernel doc tag mistakes

There is a bunch of tags marking constants with &, which means struct
or enum name. Replace them with %, which is the correct tag for
constants.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>

hv_netvsc: Fix the variable sizes in ipsecv2 and rsc offload

These fields in struct ndis_ipsecv2_offload and struct ndis_rsc_offload
are one byte according to the specs. This patch defines them with the
right size. These structs are not in use right now, but will be used soon.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: avoid unenecessary cong_update in loop transport

Loop transport which is self loopback, remote port congestion
update isn't relevant. Infact the xmit path already ignores it.
Receive path needs to do the same.

Reported-by: syzbot+4c20b3866171ce8441d2@syzkaller.appspotmail.com
Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'l2tp-fixes'

Guillaume Nault says:

====================
l2tp: pppol2tp_connect() fixes

This series fixes a few remaining issues with pppol2tp_connect().

It doesn't try to prevent invalid configurations that have no effect on
kernel's reliability. That would be work for a future patch set.

Patch 2 is the most important as it avoids an invalid pointer
dereference crashing the kernel. It depends on patch 1 for correctly
identifying L2TP session types.

Patches 3 and 4 avoid creating stale tunnels and sessions.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: clean up stale tunnel or session in pppol2tp_connect's error path

pppol2tp_connect() may create a tunnel or a session. Remove them in
case of error.

Fixes: b2079113577e ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: prevent pppol2tp_connect() from creating kernel sockets

If 'fd' is negative, l2tp_tunnel_create() creates a tunnel socket using
the configuration passed in 'tcfg'. Currently, pppol2tp_connect() sets
the relevant fields to zero, tricking l2tp_tunnel_create() into setting
up an unusable kernel socket.

We can't set 'tcfg' with the required fields because there's no way to
get them from the current connect() parameters. So let's restrict
kernel sockets creation to the netlink API, which is the original use
case.

Fixes: 7ecc1f0da88f ("l2tp: Add support for static unmanaged L2TPv3 tunnels")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: only accept PPP sessions in pppol2tp_connect()

l2tp_session_priv() returns a struct pppol2tp_session pointer only for
PPPoL2TP sessions. In particular, if the session is an L2TP_PWTYPE_ETH
pseudo-wire, l2tp_session_priv() returns a pointer to an l2tp_eth_sess
structure, which is much smaller than struct pppol2tp_session. This
leads to invalid memory dereference when trying to lock ps->sk_lock.

Fixes: 9911951a94eb ("l2tp: Add L2TP ethernet pseudowire support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

l2tp: fix pseudo-wire type for sessions created by pppol2tp_connect()

Define cfg.pw_type so that the new session is created with its .pwtype
field properly set (L2TP_PWTYPE_PPP).

Not setting the pseudo-wire type had several annoying effects:

  * Invalid value returned in the L2TP_ATTR_PW_TYPE attribute when
    dumping sessions with the netlink API.

  * Impossibility to delete the session using the netlink API (because
    l2tp_nl_cmd_session_delete() gets the deletion callback function
    from an array indexed by the session's pseudo-wire type).

Also, there are several cases where we should check a session's
pseudo-wire type. For example, pppol2tp_connect() should refuse to
connect a session that is not PPPoL2TP, but that requires the session's
.pwtype field to be properly set.

Fixes: 9b4d7c520ad6 ("l2tp: Add L2TPv3 protocol support")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'emaclite-fixes'

Radhey Shyam Pandey says:

====================
emaclite bug fixes and code cleanup

This patch series fixes bug in emaclite remove and mdio_setup routines.
It does minor code cleanup.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: emaclite: Remove xemaclite_mdio_setup return check

Errors are already reported in xemaclite_mdio_setup so avoid
reporting it again.

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: emaclite: Remove unused 'has_mdio' flag.

Remove unused 'has_mdio' flag.

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: emaclite: Fix MDIO bus unregister bug

Since 'has_mdio' flag is not used,sequence insmod->rmmod-> insmod
leads to failure as MDIO unregister doesn't happen in .remove().
Fix it by checking MII bus pointer instead.

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: emaclite: Fix position of lp->mii_bus assignment

To ensure MDIO bus is not double freed in remove() path
assign lp->mii_bus after MDIO bus registration.

Signed-off-by: Radhey Shyam Pandey <radhey.shyam.pandey@xilinx.com>
Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: verify the checksum of the first data segment in a new connection

commit abe93be0399c ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.

But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly. These lower-level processing functions do not do any checksum
verification.

Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.

Fixes: abe93be0399c ("tcp/dccp: install syn_recv requests into ehash table")
Signed-off-by: Frank van der Linden <fllinden@amazon.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Balbir Singh <bsingharora@gmail.com>
Reviewed-by: Balbir Singh <bsingharora@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: qcom/emac: Add missing of_node_put()

Add missing of_node_put() call for device node returned by
of_parse_phandle().

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Timur Tabi <timur@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

sctp: define sctp_packet_gso_append to build GSO frames

Now sctp GSO uses skb_gro_receive() to append the data into head
skb frag_list. However it actually only needs very few code from
skb_gro_receive(). Besides, NAPI_GRO_CB has to be set while most
of its members are not needed here.

This patch is to add sctp_packet_gso_append() to build GSO frames
instead of skb_gro_receive(), and it would avoid many unnecessary
checks and make the code clearer.

Note that sctp will use page frags instead of frag_list to build
GSO frames in another patch. But it may take time, as sctp's GSO
frames may have different size. skb_segment() can only split it
into the frags with the same size, which would break the border
of sctp chunks.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter patches for your net tree:

1) Fix NULL pointer dereference from nf_nat_decode_session() if NAT is
   not loaded, from Prashant Bhole.

2) Fix socket extension module autoload.

3) Don't bogusly reject sets with the NFT_SET_EVAL flag set on from
   the dynset extension.

4) Fix races with nf_tables module removal and netns exit path,
   patches from Florian Westphal.

5) Don't hit BUG_ON if jumpstack goes too deep, instead hit
   WARN_ON_ONCE, from Taehee Yoo.

6) Another NULL pointer dereference from ctnetlink, again if NAT is
   not loaded, from Florian Westphal.

7) Fix x_tables match list corruption in xt_connmark module removal
   path, also from Florian.

8) nf_conncount doesn't properly deal with conntrack zones, hence
   garbage collector may get rid of entries in a different zone.
   From Yi-Hung Wei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

xen/netfront: raise max number of slots in xennet_get_responses()

The max number of slots used in xennet_get_responses() is set to
MAX_SKB_FRAGS + (rx->status <= RX_COPY_THRESHOLD).

In old kernel-xen MAX_SKB_FRAGS was 18, while nowadays it is 17. This
difference is resulting in frequent messages "too many slots" and a
reduced network throughput for some workloads (factor 10 below that of
a kernel-xen based guest).

Replacing MAX_SKB_FRAGS by XEN_NETIF_NR_SLOTS_MIN for calculation of
the max number of slots to use solves that problem (tests showed no
more messages "too many slots" and throughput was as high as with the
kernel-xen based guest system).

Replace MAX_SKB_FRAGS-2 by XEN_NETIF_NR_SLOTS_MIN-1 in
netfront_tx_slot_available() for making it clearer what is really being
tested without actually modifying the tested value.

Signed-off-by: Juergen Gross <jgross@suse.com>
Reviewed-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

smc: convert to ->poll_mask

smc->clcsock is an internal TCP socket, after TCP socket
converts to ->poll_mask, ->poll doesn't exist any more.
So just convert smc socket to ->poll_mask too.

Fixes: ffa4cbc0e784 ("net/tcp: convert to ->poll_mask")
Reported-by: syzbot+f5066e369b2d5fff630f@syzkaller.appspotmail.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: stmmac: dwmac-meson8b: Fix an error handling path in 'meson8b_dwmac_probe()'

If 'of_device_get_match_data()' fails, we need to release some resources as
done in the other error handling path of this function.

Fixes: aae8786d0b5a ("net: stmmac: dwmac-meson: extend phy mode setting")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>

tc-testing: ife: fix wrong teardown command in test b7b8

fix failures in the 'teardown' stage of test b7b8, probably a leftover of
commit 6473a871dc3e ("tc-testing: fixed copy-pasting error in ife tests")

Fixes: a18660a76cc6f ("tc-testing: updated ife test cases")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: thunderx: prevent concurrent data re-writing by nicvf_set_rx_mode

For each network interface linux network stack issue ndo_set_rx_mode call
in order to configure MAC address filters (e.g. for multicast filtering).
Currently ThunderX NICVF driver has only one ordered workqueue to process
such requests for all VFs.

And because of that it is possible that subsequent call to
ndo_set_rx_mode would corrupt data which is currently in use
by nicvf_set_rx_mode_task. Which in turn could cause following issue:
[...]
[   48.978341] Unable to handle kernel paging request at virtual address 1fffff0000000000
[   48.986275] Mem abort info:
[   48.989058]   Exception class = DABT (current EL), IL = 32 bits
[   48.994965]   SET = 0, FnV = 0
[   48.998020]   EA = 0, S1PTW = 0
[   49.001152] Data abort info:
[   49.004022]   ISV = 0, ISS = 0x00000004
[   49.007869]   CM = 0, WnR = 0
[   49.010826] [1fffff0000000000] address between user and kernel address ranges
[   49.017963] Internal error: Oops: 96000004 [#1] SMP
[...]
[   49.072138] task: ffff800fdd675400 task.stack: ffff000026440000
[   49.078051] PC is at prefetch_freepointer.isra.37+0x28/0x3c
[   49.083613] LR is at kmem_cache_alloc_trace+0xc8/0x1fc
[...]
[   49.272684] [<ffff0000082738f0>] prefetch_freepointer.isra.37+0x28/0x3c
[   49.279286] [<ffff000008276bc8>] kmem_cache_alloc_trace+0xc8/0x1fc
[   49.285455] [<ffff0000082c0c0c>] alloc_fdtable+0x78/0x134
[   49.290841] [<ffff0000082c15c0>] dup_fd+0x254/0x2f4
[   49.295709] [<ffff0000080d1954>] copy_process.isra.38.part.39+0x64c/0x1168
[   49.302572] [<ffff0000080d264c>] _do_fork+0xfc/0x3b0
[   49.307524] [<ffff0000080d29e8>] SyS_clone+0x44/0x50
[...]

This patch is to prevent such concurrent data write with spinlock.

Reported-by: Dean Nelson <dnelson@redhat.com>
Signed-off-by: Vadim Lomovtsev <Vadim.Lomovtsev@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: mdio-gpio: Cut surplus includes

The GPIO MDIO driver now needs only <linux/gpio/consumer.h>
so cut the legacy <linux/gpio.h> and <linux/of_gpio.h>
includes that are no longer used.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'hv_netvsc-notification-and-namespace-fixes'

Stephen Hemminger says:

====================
hv_netvsc: notification and namespace fixes

This set of patches addresses two set of fixes. First it backs out
the common callback model which was merged in net-next without
completing all the review feedback or getting maintainer approval.

Then it fixes the transparent VF management code to handle network
namespaces.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: move VF to same namespace as netvsc device

When VF is added, the paravirtual device is already present
and may have been moved to another network namespace. For example,
sometimes the management interface is put in another net namespace
in some environments.

The VF should get moved to where the netvsc device is when the
VF is discovered. The user can move it later (if desired).

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: fix network namespace issues with VF support

When finding the parent netvsc device, the search needs to be across
all netvsc device instances (independent of network namespace).

Find parent device of VF using upper_dev_get routine which
searches only adjacent list.

Fixes: ff172301a320 ("hv_netvsc: improve VF device matching")
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
netns aware byref
Signed-off-by: David S. Miller <davem@davemloft.net>

hv_netvsc: drop common code until callback model fixed

The callback model of handling network failover is not suitable
in the current form.
  1. It was merged without addressing all the review feedback.
  2. It was merged without approval of any of the netvsc maintainers.
  3. Design discussion on how to handle PV/VF fallback is still
     not complete.
  4. IMHO the code model using callbacks is trying to make
     something common which isn't.

Revert the netvsc specific changes for now. Does not impact ongoing
development of failover model for virtio.
Revisit this after a simpler library based failover kernel
routines are extracted.

This reverts
commit 49b06480a4c2 ("hv_netvsc: fix error return code in netvsc_probe()")
and
commit eefdd9082e70 ("netvsc: refactor notifier/event handling code to use the failover framework")

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'nfp-fixes'

Jakub Kicinski says:

====================
nfp: fix a warning, stats, naming and route leak

Various fixes for the NFP. Patch 1 fixes a harmless GCC 8 warning.
Patch 2 ensures statistics are correct after users decrease the number
of channels/rings. Patch 3 restores phy_port_name behaviour for flower,
ndo_get_phy_port_name used to return -EOPNOTSUPP on one of the netdevs,
and we need to keep it that way otherwise interface names may change.
Patch 4 fixes refcnt leak in flower tunnel offload code.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: flower: free dst_entry in route table

We need to release the refcnt on dst_entry in the route table, otherwise
we will leak the route.

Fixes: afd57c0543bb ("nfp: flower vxlan neighbour offload")
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: Louis Peens <louis.peens@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: remove phys_port_name on flower's vNIC

.ndo_get_phys_port_name was recently extended to support multi-vNIC
FWs.  These are firmwares which can have more than one vNIC per PF
without associated port (e.g. Adaptive Buffer Management FW), therefore
we need a way of distinguishing the vNICs.  Unfortunately, it's too
late to make flower use the same naming.  Flower users may depend on
.ndo_get_phys_port_name returning -EOPNOTSUPP, for example the name
udev gave the PF vNIC was just the bare PCI device-based name before
the change, and will have 'nn0' appended after.

To ensure flower's vNIC doesn't have phys_port_name attribute, add
a flag to vNIC struct and set it in flower code.  New projects will
not set the flag adhere to the naming scheme from the start.

Fixes: d17fe5019850 ("nfp: assign vNIC id as phys_port_name of vNICs which are not ports")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: include all ring counters in interface stats

We are gathering software statistics on per-ring basis.
.ndo_get_stats64 handler adds the rings up. Unfortunately
we are currently only adding up active rings, which means
that if user decreases the number of active rings the
statistics from deactivated rings will no longer be counted
and total interface statistics may go backwards.

Always sum all possible rings, the stats are allocated
statically for max number of rings, so we don't have to
worry about them being removed. We could add the stats
up when user changes the ring count, but it seems unnecessary..
Adding up inactive rings will be very quick since no datapath
will be touching them.

Fixes: b1f6575f948d ("nfp: add support for ethtool .set_channels")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: don't pad strings in nfp_cpp_resource_find() to avoid gcc 8 warning

Once upon a time nfp_cpp_resource_find() took a name parameter,
which could be any user-chosen string.  Resources are identified
by a CRC32 hash of a 8 byte string, so we had to pad user input
with zeros to make sure CRC32 gave the correct result.

Since then nfp_cpp_resource_find() was made to operate on allocated
resources only (struct nfp_resource).  We kzalloc those so there is
no need to pad the strings and use memcmp.

This avoids a GCC 8 stringop-truncation warning:

In function ‘nfp_cpp_resource_find’,
    inlined from ‘nfp_resource_try_acquire’ at .../nfpcore/nfp_resource.c:153:8,
    inlined from ‘nfp_resource_acquire’ at .../nfpcore/nfp_resource.c:206:9:
    .../nfpcore/nfp_resource.c:108:2: warning:  strncpy’ output may be truncated copying 8 bytes from a string of length 8 [-Wstringop-truncation]
      strncpy(name_pad, res->name, sizeof(name_pad));
      ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Revert "net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets"

Revert the patch mentioned in the subject because it breaks at least
the Avahi mDNS daemon. That patch namely causes the Ubuntu 18.04 Avahi
daemon to fail to start:

Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Successfully called chroot().
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Successfully dropped remaining capabilities.
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: No service file found in /etc/avahi/services.
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: SO_REUSEADDR failed: Structure needs cleaning
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: SO_REUSEADDR failed: Structure needs cleaning
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: Failed to create server: No suitable network protocol available
Jun 12 09:49:24 ubuntu-vm avahi-daemon[529]: avahi-daemon 0.7 exiting.
Jun 12 09:49:24 ubuntu-vm systemd[1]: avahi-daemon.service: Main process exited, code=exited, status=255/n/a
Jun 12 09:49:24 ubuntu-vm systemd[1]: avahi-daemon.service: Failed with result 'exit-code'.
Jun 12 09:49:24 ubuntu-vm systemd[1]: Failed to start Avahi mDNS/DNS-SD Stack.

Fixes: 22db4c55a1ff ("net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets")
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

netfilter: nf_conncount: Fix garbage collection with zones

Currently, we use check_hlist() for garbage colleciton. However, we
use the ‘zone’ from the counted entry to query the existence of
existing entries in the hlist. This could be wrong when they are in
different zones, and this patch fixes this issue.

Fixes: 02b9b7d706e5 ("netfilter: xt_connlimit: honor conntrack zone if available")
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: xt_connmark: fix list corruption on rmmod

This needs to use xt_unregister_targets, else new revision is left
on the list which then causes list to point to a target struct that has been free'd.

Fixes: 505bf7ba1ece ("netfilter: xt_conntrack: Support bit-shifting for CONNMARK & MARK targets.")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: ctnetlink: avoid null pointer dereference

Dan Carpenter points out that deref occurs after NULL check, we should
re-fetch the pointer and check that instead.

Fixes: b34fe2574e2a1 ("netfilter: add struct nf_nat_hook and use it")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: use WARN_ON_ONCE instead of BUG_ON in nft_do_chain()

When depth of chain is bigger than NFT_JUMP_STACK_SIZE, the nft_do_chain
crashes. But there is no need to crash hard here.

Suggested-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: close race between netns exit and rmmod

If net namespace is exiting while nf_tables module is being removed
we can oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: nf_tables_flowtable_event+0x43/0xf0 [nf_tables]
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
Modules linked in: nf_tables(-) nfnetlink [..]
  unregister_netdevice_notifier+0xdd/0x130
  nf_tables_module_exit+0x24/0x3a [nf_tables]
  SyS_delete_module+0x1c5/0x240
  do_syscall_64+0x74/0x190

Avoid this by attempting to take reference on the net namespace from
the notifiers.  If it fails the namespace is exiting already, and nft
core is taking care of cleanup work.

We also need to make sure the netdev hook type gets removed
before netns ops removal, else notifier might be invoked with device
event for a netns where net->nft was never initialised (because
pernet ops was removed beforehand).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: fix module unload race

We must first remove the nfnetlink protocol handler when nf_tables module
is unloaded -- we don't want userspace to submit new change requests once
we've started to tear down nft state.

Furthermore, nfnetlink must not call any subsystem function after
call_batch returned -EAGAIN.

EAGAIN means the subsys mutex was dropped, so its unlikely but possible that
nf_tables subsystem was removed due to 'rmmod nf_tables' on another cpu.

Therefore, we must abort batch completely and not move on to next part of
the batch.

Last, we can't invoke ->abort unless we've checked that the subsystem is
still registered.

Change netns exit path of nf_tables to make sure any incompleted
transaction gets removed on exit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_dynset: do not reject set updates with NFT_SET_EVAL

NFT_SET_EVAL is signalling the kernel that this sets can be updated from
the evaluation path, even if there are no expressions attached to the
element. Otherwise, set updates with no expressions fail. Update
description to describe the right semantics.

Fixes: e907265f4d6d ("netfilter: nf_tables: add support for dynamic set updates")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_socket: fix module autoload

Add alias definition for module autoload when adding socket rules.

Fixes: 6076d560d83e ("netfilter: nf_tables: add support for native socket matching")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: fix null-ptr-deref in nf_nat_decode_session

Add null check for nat_hook in nf_nat_decode_session()

[  195.648098] UBSAN: Undefined behaviour in ./include/linux/netfilter.h:348:14
[  195.651366] BUG: KASAN: null-ptr-deref in __xfrm_policy_check+0x208/0x1d70
[  195.653888] member access within null pointer of type 'struct nf_nat_hook'
[  195.653896] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.17.0-rc6+ #5
[  195.656320] Read of size 8 at addr 0000000000000008 by task ping/2469
[  195.658715] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[  195.658721] Call Trace:
[  195.661087]
[  195.669341]  <IRQ>
[  195.670574]  dump_stack+0xc6/0x150
[  195.672156]  ? dump_stack_print_info.cold.0+0x1b/0x1b
[  195.674121]  ? ubsan_prologue+0x31/0x92
[  195.676546]  ubsan_epilogue+0x9/0x49
[  195.678159]  handle_null_ptr_deref+0x11a/0x130
[  195.679800]  ? sprint_OID+0x1a0/0x1a0
[  195.681322]  __ubsan_handle_type_mismatch_v1+0xd5/0x11d
[  195.683146]  ? ubsan_prologue+0x92/0x92
[  195.684642]  __xfrm_policy_check+0x18ef/0x1d70
[  195.686294]  ? rt_cache_valid+0x118/0x180
[  195.687804]  ? __xfrm_route_forward+0x410/0x410
[  195.689463]  ? fib_multipath_hash+0x700/0x700
[  195.691109]  ? kvm_sched_clock_read+0x23/0x40
[  195.692805]  ? pvclock_clocksource_read+0xf6/0x280
[  195.694409]  ? graph_lock+0xa0/0xa0
[  195.695824]  ? pvclock_clocksource_read+0xf6/0x280
[  195.697508]  ? pvclock_read_flags+0x80/0x80
[  195.698981]  ? kvm_sched_clock_read+0x23/0x40
[  195.700347]  ? sched_clock+0x5/0x10
[  195.701525]  ? sched_clock_cpu+0x18/0x1a0
[  195.702846]  tcp_v4_rcv+0x1d32/0x1de0
[  195.704115]  ? lock_repin_lock+0x70/0x270
[  195.707072]  ? pvclock_read_flags+0x80/0x80
[  195.709302]  ? tcp_v4_early_demux+0x4b0/0x4b0
[  195.711833]  ? lock_acquire+0x195/0x380
[  195.714222]  ? ip_local_deliver_finish+0xfc/0x770
[  195.716967]  ? raw_rcv+0x2b0/0x2b0
[  195.718856]  ? lock_release+0xa00/0xa00
[  195.720938]  ip_local_deliver_finish+0x1b9/0x770
[...]

Fixes: b34fe2574e2a ("netfilter: add struct nf_nat_hook and use it")
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

tcp: Do not reload skb pointer after skb_gro_receive().

This is not necessary. skb_gro_receive() will never change what
'head' points to.

In it's original implementation (see commit 389c80170dbd ("net: Add
skb_gro_receive")), it did:

====================
+ *head = nskb;
+ nskb->next = p->next;
+ p->next = NULL;
====================

This sequence was removed in commit a52e86ba7e9c ("net: gro: remove
obsolete code from skb_gro_receive()")

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2018-06-12

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Avoid an allocation warning in AF_XDP by adding __GFP_NOWARN for the
   umem setup, from Björn.

2) Silence a warning in bpf fs when an application tries to open(2) a
   pinned bpf obj due to missing fops. Add a dummy open fop that continues
   to just bail out in such case, from Daniel.

3) Fix a BPF selftest urandom_read build issue where gcc complains that
   it gets built twice, from Anders.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch '10GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue

Jeff Kirsher says:

====================
Intel Wired LAN Driver Updates 2018-06-11

This series contains fixes to ixgbe IPsec and MACVLAN.

Alex provides the 5 fixes in this series, starting with fixing an issue
where num_rx_pools was not being populated until after the queues and
interrupts were reinitialized when enabling MACVLAN interfaces.  Updated
to use CONFIG_XFRM_OFFLOAD instead of CONFIG_XFRM, since the code
requires CONFIG_XFRM_OFFLOAD to be enabled.  Moved the IPsec
initialization function to be more consistent with the placement of
similar initialization functions and before the call to reset the
hardware, which will clean up any link issues that may have been
introduced.  Fixed the boolean logic that was testing for transmit OR
receive ready bits, when it should have been testing for transmit AND
receive ready bits.  Fixed the bit definitions for SECTXSTAT and SECRXSTAT
registers and ensure that if IPsec is disabled on the part, do not
enable it.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net/ipv6: Ensure cfg is properly initialized in ipv6_create_tempaddr

Valdis reported a BUG in ipv6_add_addr:

[ 1820.832682] BUG: unable to handle kernel NULL pointer dereference at 0000000000000209
[ 1820.832728] RIP: 0010:ipv6_add_addr+0x280/0xd10
[ 1820.832732] Code: 49 8b 1f 0f 84 6a 0a 00 00 48 85 db 0f 84 4e 0a 00 00 48 8b 03 48 8b 53 08 49 89 45 00 49 8b 47 10
49 89 55 08 48 85 c0 74 15 <48> 8b 50 08 48 8b 00 49 89 95 b8 01 00 00 49 89 85 b0 01 00 00 4c
[ 1820.832847] RSP: 0018:ffffaa07c2fd7880 EFLAGS: 00010202
[ 1820.832853] RAX: 0000000000000201 RBX: ffffaa07c2fd79b0 RCX: 0000000000000000
[ 1820.832858] RDX: a4cfbfba2cbfa64c RSI: 0000000000000000 RDI: ffffffff8a8e9fa0
[ 1820.832862] RBP: ffffaa07c2fd7920 R08: 000000000000017a R09: ffffffff8a555300
[ 1820.832866] R10: 0000000000000000 R11: 0000000000000000 R12: ffff888d18e71c00
[ 1820.832871] R13: ffff888d0a9b1200 R14: 0000000000000000 R15: ffffaa07c2fd7980
[ 1820.832876] FS:  00007faa51bdb800(0000) GS:ffff888d1d400000(0000) knlGS:0000000000000000
[ 1820.832880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1820.832885] CR2: 0000000000000209 CR3: 000000021e8f8001 CR4: 00000000001606e0
[ 1820.832888] Call Trace:
[ 1820.832898]  ? __local_bh_enable_ip+0x119/0x260
[ 1820.832904]  ? ipv6_create_tempaddr+0x259/0x5a0
[ 1820.832912]  ? __local_bh_enable_ip+0x139/0x260
[ 1820.832921]  ipv6_create_tempaddr+0x2da/0x5a0
[ 1820.832926]  ? ipv6_create_tempaddr+0x2da/0x5a0
[ 1820.832941]  manage_tempaddrs+0x1a5/0x240
[ 1820.832951]  inet6_addr_del+0x20b/0x3b0
[ 1820.832959]  ? nla_parse+0xce/0x1e0
[ 1820.832968]  inet6_rtm_deladdr+0xd9/0x210
[ 1820.832981]  rtnetlink_rcv_msg+0x1d4/0x5f0

Looking at the code I found 1 element (peer_pfx) of the newly introduced
ifa6_config struct that is not initialized. Use a memset rather than hard
coding an init for each struct element.

Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Fixes: 95a8590585ea3 ("net/ipv6: Convert ipv6_add_addr to struct ifa6_config")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

tls: fix NULL pointer dereference on poll

While hacking on kTLS, I ran into the following panic from an
unprivileged netserver / netperf TCP session:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
  PGD 800000037f378067 P4D 800000037f378067 PUD 3c0e61067 PMD 0
  Oops: 0010 [#1] SMP KASAN PTI
  CPU: 1 PID: 2289 Comm: netserver Not tainted 4.17.0+ #139
  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET47W (1.21 ) 11/28/2016
  RIP: 0010:          (null)
  Code: Bad RIP value.
  RSP: 0018:ffff88036abcf740 EFLAGS: 00010246
  RAX: dffffc0000000000 RBX: ffff88036f5f6800 RCX: 1ffff1006debed26
  RDX: ffff88036abcf920 RSI: ffff8803cb1a4f00 RDI: ffff8803c258c280
  RBP: ffff8803c258c280 R08: ffff8803c258c280 R09: ffffed006f559d48
  R10: ffff88037aacea43 R11: ffffed006f559d49 R12: ffff8803c258c280
  R13: ffff8803cb1a4f20 R14: 00000000000000db R15: ffffffffc168a350
  FS:  00007f7e631f4700(0000) GS:ffff8803d1c80000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: ffffffffffffffd6 CR3: 00000003ccf64005 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   ? tls_sw_poll+0xa4/0x160 [tls]
   ? sock_poll+0x20a/0x680
   ? do_select+0x77b/0x11a0
   ? poll_schedule_timeout.constprop.12+0x130/0x130
   ? pick_link+0xb00/0xb00
   ? read_word_at_a_time+0x13/0x20
   ? vfs_poll+0x270/0x270
   ? deref_stack_reg+0xad/0xe0
   ? __read_once_size_nocheck.constprop.6+0x10/0x10
  [...]

Debugging further, it turns out that calling into ctx->sk_poll() is
invalid since sk_poll itself is NULL which was saved from the original
TCP socket in order for tls_sw_poll() to invoke it.

Looks like the recent conversion from poll to poll_mask callback started
in 745ac7be7fa1 ("net: add support for ->poll_mask in proto_ops") missed
to eventually convert kTLS, too: TCP's ->poll was converted over to the
->poll_mask in commit ffa4cbc0e784 ("net/tcp: convert to ->poll_mask")
and therefore kTLS wrongly saved the ->poll old one which is now NULL.

Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
tcp_poll_mask() as well that is mangled here.

Fixes: ffa4cbc0e784 ("net/tcp: convert to ->poll_mask")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Dave Watson <davejwatson@fb.com>
Tested-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

xsk: silence warning on memory allocation failure

syzkaller reported a warning from xdp_umem_pin_pages():

  WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996
  ...
  __do_kmalloc mm/slab.c:3713 [inline]
  __kmalloc+0x25/0x760 mm/slab.c:3727
  kmalloc_array include/linux/slab.h:634 [inline]
  kcalloc include/linux/slab.h:645 [inline]
  xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
  xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
  xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
  xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
  __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
  __do_sys_setsockopt net/socket.c:1946 [inline]
  __se_sys_setsockopt net/socket.c:1943 [inline]
  __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is a warning about attempting to allocate more than
KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
the request is too big, the kernel is free to deny its allocation. In
this patch, the failed allocation attempt is silenced with
__GFP_NOWARN.

Fixes: 8b9341b67841 ("xsk: add user memory registration support sockopt")
Reported-by: syzbot+4abadc5d69117b346506@syzkaller.appspotmail.com
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree:

1) Reject non-null terminated helper names from xt_CT, from Gao Feng.

2) Fix KASAN splat due to out-of-bound access from commit phase, from
   Alexey Kodanev.

3) Missing conntrack hook registration on IPVS FTP helper, from Julian
   Anastasov.

4) Incorrect skbuff allocation size in bridge nft_reject, from Taehee Yoo.

5) Fix inverted check on packet xmit to non-local addresses, also from
   Julian.

6) Fix ebtables alignment compat problems, from Alin Nastac.

7) Hook mask checks are not correct in xt_set, from Serhey Popovych.

8) Fix timeout listing of element in ipsets, from Jozsef.

9) Cap maximum timeout value in ipset, also from Jozsef.

10) Don't allow family option for hash:mac sets, from Florent Fourcot.

11) Restrict ebtables to work with NFPROTO_BRIDGE targets only, this
    Florian.

12) Another bug reported by KASAN in the rbtree set backend, from
    Taehee Yoo.

13) Missing __IPS_MAX_BIT update doesn't include IPS_OFFLOAD_BIT.
    From Gao Feng.

14) Missing initialization of match/target in ebtables, from Florian
    Westphal.

15) Remove useless nft_dup.h file in include path, from C. Labbe.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: add error handling for pskb_trim_rcsum

When pskb_trim_rcsum fails, the lack of error-handling code may
cause unexpected results.

This patch adds error-handling code after calling pskb_trim_rcsum.

Signed-off-by: Zhouyang Jia <jiazhouyang09@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ipv6: allow PMTU exceptions to local routes

IPVS setups with local client and remote tunnel server need
to create exception for the local virtual IP. What we do is to
change PMTU from 64KB (on "lo") to 1460 in the common case.

Suggested-by: Martin KaFai Lau <kafai@fb.com>
Fixes: 0c8dfdb319ff ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Fixes: c9859188b5c8 ("ipv6: Don't create clones of host routes.")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ixgbe: Fix bit definitions and add support for testing for ipsec support

This patch addresses two issues. First it adds the correct bit definitions
for the SECTXSTAT and SECRXSTAT registers. Then it makes use of those
definitions to test for if IPsec has been disabled on the part and if so we
do not enable it.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Reported-by: Andre Tomt <andre@tomt.net>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Avoid loopback and fix boolean logic in ipsec_stop_data

This patch fixes two issues. First we add an early test for the Tx and Rx
security block ready bits. By doing this we can avoid the need for waits or
loopback in the event that the security block is already flushed out.
Secondly we fix the boolean logic that was testing for the Tx OR Rx ready
bits being set and change it so that we only exit if the Tx AND Rx ready
bits are both set.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Move ipsec init function to before reset call

This patch moves the IPsec init function in ixgbe_sw_init. This way it is a
bit more consistent with the placement of similar initialization functions
and is placed before the reset_hw call which should allow us to clean up
any link issues that may be introduced by the fact that we force the link
up if somehow the device had IPsec still enabled before the driver was
loaded.

In addition to the function move it is necessary to change the assignment
of netdev->features. The easiest way to do this is to just test for the
existence of adapter->ipsec and if it is present we set the feature bits.

Fixes: be03b4f4e2b5 ("ixgbe: add ipsec engine start and stop routines")
Reported-by: Andre Tomt <andre@tomt.net>
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Use CONFIG_XFRM_OFFLOAD instead of CONFIG_XFRM

There is no point in adding code if CONFIG_XFRM is defined that we won't
use unless CONFIG_XFRM_OFFLOAD is defined. So instead of leaving this code
floating around I am replacing the ifdef with what I believe is the correct
one so that we only include the code and variables if they will actually be
used.

Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Acked-by: Shannon Nelson <shannon.nelson@oracle.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

ixgbe: Fix setting of TC configuration for macvlan case

When we were enabling macvlan interfaces we weren't correctly configuring
things until ixgbe_setup_tc was called a second time either by tweaking the
number of queues or increasing the macvlan count past 15.

The issue came down to the fact that num_rx_pools is not populated until
after the queues and interrupts are reinitialized.

Instead of trying to set it sooner we can just move the call to setup at
least 1 traffic class to the SR-IOV/VMDq setup function so that we just set
it for this one case. We already had a spot that was configuring the queues
for TC 0 in the code here anyway so it makes sense to also set the number
of TCs here as well.

Fixes: 4b4c7f1569bd ("ixgbe: Fix handling of macvlan Tx offload")
Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>

selftests: bpf: fix urandom_read build issue

gcc complains that urandom_read gets built twice.

gcc -o tools/testing/selftests/bpf/urandom_read
-static urandom_read.c -Wl,--build-id
gcc -Wall -O2 -I../../../include/uapi -I../../../lib -I../../../lib/bpf
-I../../../../include/generated -I../../../include urandom_read.c
urandom_read -lcap -lelf -lrt -lpthread -o
tools/testing/selftests/bpf/urandom_read
gcc: fatal error: input file
‘tools/testing/selftests/bpf/urandom_read’ is the
same as output file
compilation terminated.
../lib.mk:110: recipe for target
'tools/testing/selftests/bpf/urandom_read' failed
To fix this issue remove the urandom_read target and so target
TEST_CUSTOM_PROGS gets used.

Fixes: cb1fff697234 ("bpf: add selftest for stackmap with BPF_F_STACK_BUILD_ID")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

1) Fix several bpfilter/UMH bugs, in particular make the UMH build not
    depend upon X86 specific Kconfig symbols. From Alexei Starovoitov.

2) Fix handling of modified context pointer in bpf verifier, from
    Daniel Borkmann.

3) Kill regression in ifdown/ifup sequences for hv_netvsc driver, from
    Dexuan Cui.

4) When the bonding primary member name changes, we have to re-evaluate
    the bond->force_primary setting, from Xiangning Yu.

5) Eliminate possible padding beyone end of SKB in cdc_ncm driver, from
    Bjørn Mork.

6) RX queue length reported for UDP sockets in procfs and socket diag
    are inaccurate, from Paolo Abeni.

7) Fix br_fdb_find_port() locking, from Petr Machata.

8) Limit sk_rcvlowat values properly in TCP, from Soheil Hassas
    Yeganeh.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (23 commits)
  tcp: limit sk_rcvlowat by the maximum receive buffer
  net: phy: dp83822: use BMCR_ANENABLE instead of BMSR_ANEGCAPABLE for DP83620
  socket: close race condition between sock_close() and sockfs_setattr()
  net: bridge: Fix locking in br_fdb_find_port()
  udp: fix rx queue len reported by diag and proc interface
  cdc_ncm: avoid padding beyond end of skb
  net/sched: act_simple: fix parsing of TCA_DEF_DATA
  net: fddi: fix a possible null-ptr-deref
  net: aquantia: fix unsigned numvecs comparison with less than zero
  net: stmmac: fix build failure due to missing COMMON_CLK dependency
  bpfilter: fix race in pipe access
  bpf, xdp: fix crash in xdp_umem_unaccount_pages
  xsk: Fix umem fill/completion queue mmap on 32-bit
  tools/bpf: fix selftest get_cgroup_id_user
  bpfilter: fix OUTPUT_FORMAT
  umh: fix race condition
  net: mscc: ocelot: Fix uninitialized error in ocelot_netdevice_event()
  bonding: re-evaluate force_primary when the primary slave name changes
  ip_tunnel: Fix name string concatenate in __ip_tunnel_create()
  hv_netvsc: Fix a network regression after ifdown/ifup
  ...

Merge tag 'rtc-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

Pull RTC updates from Alexandre Belloni:
"Setting the supported range from drivers for RTCs failing soon has
  started. A few fixes are developed along the way. Some drivers have
  been switched to SPDX by their maintainers.

  Subsystem:

   - rework of the rtc-test driver which allows to test the core more
     thoroughly

   - rtc_set_alarm() now fails early when alarms are not supported

  Drivers:

   - mktime() is now replaced by mktime64()

   - RTC range added for 88pm80x, ab-b5ze-s3, at91rm9200,
     brcmstb-waketimer, ds1685, ftrtc010, ls1x, mxc_v2, rx8581, sprd,
     st-lpc, tps6586x, tps65910 and vr41xx

   - fixed a possible race condition in probe functions

   - pxa: fix the probe function that is broken since v4.3

   - stm32: now supports stm32mp1"

* tag 'rtc-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (78 commits)
  rtc: pxa: fix probe function
  rtc: cros-ec: Switch to SPDX identifier.
  rtc: cros-ec: Make license text and module license match.
  rtc: ensure rtc_set_alarm fails when alarms are not supported
  rtc: test: remove alarm support from the first device
  rtc: test: convert to devm_rtc_allocate_device
  rtc: ftrtc010: let the core handle range
  rtc: ftrtc010: handle dates after 2106
  rtc: ftrtc010: switch to devm_rtc_allocate_device
  rtc: mrst: switch to devm functions
  rtc: sunxi: fix possible race condition
  rtc: test: remove irq sysfs file
  rtc: test: emulate alarms using timers
  rtc: test: store time as an offset to system time
  rtc: test: allow registering many devices
  rtc: test: remove useless proc info
  rtc: ds1685: Add range
  rtc: ds1685: fix possible race condition
  rtc: sprd: Add new RTC power down check method
  rtc: sun6i: Fix bit_idx value for clk_register_gate
  ...

Merge tag 'upstream-4.18-rc1' of git://git.infradead.org/linux-ubifs

Pull UBI and UBIFS updates from Richard Weinberger:

- the UBI on-disk format header file is now dual licensed

- new way to detect Fastmap problems during runtime

- bugfix for Fastmap

- minor updates for UBIFS (spelling, comments, vm_fault_t, ...)

* tag 'upstream-4.18-rc1' of git://git.infradead.org/linux-ubifs:
  mtd: ubi: Update ubi-media.h to dual license
  ubi: fastmap: Detect EBA mismatches on-the-fly
  ubi: fastmap: Check each mapping only once
  ubi: fastmap: Correctly handle interrupted erasures in EBA
  ubi: fastmap: Cancel work upon detach
  ubifs: lpt: Fix wrong pnode number range in comment
  ubifs: gc: Fix typo
  ubifs: log: Some spelling fixes
  ubifs: Spelling fix someting -> something
  ubifs: journal: Remove wrong comment
  ubifs: remove set but never used variable
  ubifs, xattr: remove misguided quota flags
  fs: ubifs: Adding new return type vm_fault_t

tcp: limit sk_rcvlowat by the maximum receive buffer

The user-provided value to setsockopt(SO_RCVLOWAT) can be
larger than the maximum possible receive buffer. Such values
mute POLLIN signals on the socket which can stall progress
on the socket.

Limit the user-provided value to half of the maximum receive
buffer, i.e., half of sk_rcvbuf when the receive buffer size
is set by the user, or otherwise half of sysctl_tcp_rmem[2].

Fixes: 96ba1ba9c27a ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning")
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

Pull SCSI updates from James Bottomley:
"This is mostly updates to the usual drivers: ufs, qedf, mpt3sas, lpfc,
  xfcp, hisi_sas, cxlflash, qla2xxx.

  In the absence of Nic, we're also taking target updates which are
  mostly minor except for the tcmu refactor.

  The only real core change to worry about is the removal of high page
  bouncing (in sas, storvsc and iscsi). This has been well tested and no
  problems have shown up so far"

* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (268 commits)
  scsi: lpfc: update driver version to 12.0.0.4
  scsi: lpfc: Fix port initialization failure.
  scsi: lpfc: Fix 16gb hbas failing cq create.
  scsi: lpfc: Fix crash in blk_mq layer when executing modprobe -r lpfc
  scsi: lpfc: correct oversubscription of nvme io requests for an adapter
  scsi: lpfc: Fix MDS diagnostics failure (Rx < Tx)
  scsi: hisi_sas: Mark PHY as in reset for nexus reset
  scsi: hisi_sas: Fix return value when get_free_slot() failed
  scsi: hisi_sas: Terminate STP reject quickly for v2 hw
  scsi: hisi_sas: Add v2 hw force PHY function for internal ATA command
  scsi: hisi_sas: Include TMF elements in struct hisi_sas_slot
  scsi: hisi_sas: Try wait commands before before controller reset
  scsi: hisi_sas: Init disks after controller reset
  scsi: hisi_sas: Create a scsi_host_template per HW module
  scsi: hisi_sas: Reset disks when discovered
  scsi: hisi_sas: Add LED feature for v3 hw
  scsi: hisi_sas: Change common allocation mode of device id
  scsi: hisi_sas: change slot index allocation mode
  scsi: hisi_sas: Introduce hisi_sas_phy_set_linkrate()
  scsi: hisi_sas: fix a typo in hisi_sas_task_prep()
  ...

net: phy: dp83822: use BMCR_ANENABLE instead of BMSR_ANEGCAPABLE for DP83620

DP83620 register set is compatible with the DP83848, but it also supports
100base-FX. When the hardware is configured such as that fiber mode is
enabled, autonegotiation is not possible.

The chip, however, doesn't expose this information via BMSR_ANEGCAPABLE.
Instead, this bit is always set high, even if the particular hardware
configuration makes it so that auto negotiation is not possible [1]. Under
these circumstances, the phy subsystem keeps trying for autonegotiation to
happen, without success.

Hereby, we inspect BMCR_ANENABLE bit after genphy_config_init, which on
reset is set to 0 when auto negotiation is disabled, and so we use this
value instead of BMSR_ANEGCAPABLE.

[1] https://e2e.ti.com/support/interface/ethernet/f/903/p/697165/2571170

Signed-off-by: Alvaro Gamez Machado <alvaro.gamez@hazent.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

socket: close race condition between sock_close() and sockfs_setattr()

fchownat() doesn't even hold refcnt of fd until it figures out
fd is really needed (otherwise is ignored) and releases it after
it resolves the path. This means sock_close() could race with
sockfs_setattr(), which leads to a NULL pointer dereference
since typically we set sock->sk to NULL in ->release().

As pointed out by Al, this is unique to sockfs. So we can fix this
in socket layer by acquiring inode_lock in sock_close() and
checking against NULL in sockfs_setattr().

sock_release() is called in many places, only the sock_close()
path matters here. And fortunately, this should not affect normal
sock_close() as it is only called when the last fd refcnt is gone.
It only affects sock_close() with a parallel sockfs_setattr() in
progress, which is not common.

Fixes: d999f2405cd7 ("net: core: Add a UID field to struct sock.")
Reported-by: shankarapailoor <shankarapailoor@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag '4.18-fixes-smb3' of git://git.samba.org/sfrench/cifs-2.6

Pull cifs fixes from Steve French:

- one smb3 (ACL related) fix for stable

- one SMB3 security enhancement (when mounting -t smb3 forbid less
   secure dialects)

- some RDMA and compounding fixes

* tag '4.18-fixes-smb3' of git://git.samba.org/sfrench/cifs-2.6:
  cifs: fix a buffer leak in smb2_query_symlink
  smb3: do not allow insecure cifs mounts when using smb3
  CIFS: Fix NULL ptr deref
  CIFS: fix encryption in SMB3.1.1
  CIFS: Pass page offset for encrypting
  CIFS: Pass page offset for calculating signature
  CIFS: SMBD: Support page offset in memory registration
  CIFS: SMBD: Support page offset in RDMA recv
  CIFS: SMBD: Support page offset in RDMA send
  CIFS: When sending data on socket, pass the correct page offset
  CIFS: Introduce helper function to get page offset and length in smb_rqst
  CIFS: Calculate the correct request length based on page offset and tail size
  cifs: For SMB2 security informaion query, check for minimum sized security descriptor instead of sizeof FileAllInformation class
  CIFS: Fix signing for SMB2/3

Merge tag 'for-linus-20180610' of git://git.kernel.dk/linux-block

Pull block flush handling fix from Jens Axboe:
"Single fix that we should merge now, fixing a regression in queuing
  flush request, accessing request flags after calling the end_request
  handler"

* tag 'for-linus-20180610' of git://git.kernel.dk/linux-block:
  block: fix use-after-free in block flush handling

Merge branch 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull restartable sequence support from Thomas Gleixner:
"The restartable sequences syscall (finally):

  After a lot of back and forth discussion and massive delays caused by
  the speculative distraction of maintainers, the core set of
  restartable sequences has finally reached a consensus.

  It comes with the basic non disputed core implementation along with
  support for arm, powerpc and x86 and a full set of selftests

  It was exposed to linux-next earlier this week, so it does not fully
  comply with the merge window requirements, but there is really no
  point to drag it out for yet another cycle"

* 'core-rseq-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  rseq/selftests: Provide Makefile, scripts, gitignore
  rseq/selftests: Provide parametrized tests
  rseq/selftests: Provide basic percpu ops test
  rseq/selftests: Provide basic test
  rseq/selftests: Provide rseq library
  selftests/lib.mk: Introduce OVERRIDE_TARGETS
  powerpc: Wire up restartable sequences system call
  powerpc: Add syscall detection for restartable sequences
  powerpc: Add support for restartable sequences
  x86: Wire up restartable sequence system call
  x86: Add support for restartable sequences
  arm: Wire up restartable sequences system call
  arm: Add syscall detection for restartable sequences
  arm: Add restartable sequences support
  rseq: Introduce restartable sequences system call
  uapi/headers: Provide types_32_64.h

Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 updates and fixes from Thomas Gleixner:

- Fix the (late) fallout from the vector management rework causing
   hlist corruption and irq descriptor reference leaks caused by a
   missing sanity check.

   The straight forward fix triggered another long standing issue to
   surface. The pre rework code hid the issue due to being way slower,
   but now the chance that user space sees an EBUSY error return when
   updating irq affinities is way higher, though quite a bunch of
   userspace tools do not handle it properly despite the fact that EBUSY
   could be returned for at least 10 years.

   It turned out that the EBUSY return can be avoided completely by
   utilizing the existing delayed affinity update mechanism for irq
   remapped scenarios as well. That's a bit more error handling in the
   kernel, but avoids fruitless fingerpointing discussions with tool
   developers.

- Decouple PHYSICAL_MASK from AMD SME as its going to be required for
   the upcoming Intel memory encryption support as well.

- Handle legacy device ACPI detection properly for newer platforms

- Fix the wrong argument ordering in the vector allocation tracepoint

- Simplify the IDT setup code for the APIC=n case

- Use the proper string helpers in the MTRR code

- Remove a stale unused VDSO source file

- Convert the microcode update lock to a raw spinlock as its used in
   atomic context.

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/intel_rdt: Enable CMT and MBM on new Skylake stepping
  x86/apic/vector: Print APIC control bits in debugfs
  genirq/affinity: Defer affinity setting if irq chip is busy
  x86/platform/uv: Use apic_ack_irq()
  x86/ioapic: Use apic_ack_irq()
  irq_remapping: Use apic_ack_irq()
  x86/apic: Provide apic_ack_irq()
  genirq/migration: Avoid out of line call if pending is not set
  genirq/generic_pending: Do not lose pending affinity update
  x86/apic/vector: Prevent hlist corruption and leaks
  x86/vector: Fix the args of vector_alloc tracepoint
  x86/idt: Simplify the idt_setup_apic_and_irq_gates()
  x86/platform/uv: Remove extra parentheses
  x86/mm: Decouple dynamic __PHYSICAL_MASK from AMD SME
  x86: Mark native_set_p4d() as __always_inline
  x86/microcode: Make the late update update_lock a raw lock for RT
  x86/mtrr: Convert to use strncpy_from_user() helper
  x86/mtrr: Convert to use match_string() helper
  x86/vdso: Remove unused file
  x86/i8237: Register device based on FADT legacy boot flag

Merge branch 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull x86 pti updates from Thomas Gleixner:
"Three small commits updating the SSB mitigation to take the updated
  AMD mitigation variants into account"

* 'x86-pti-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/bugs: Switch the selection of mitigation from CPU vendor to CPU features
  x86/bugs: Add AMD's SPEC_CTRL MSR usage
  x86/bugs: Add AMD's variant of SSB_NO

Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull more perf tooling updates from Thomas Gleixner:
"Perf tool updates and fixes:

  perf stat:

   - Display user and system time for workload targets (Jiri Olsa)

  perf record:

   - Enable arbitrary event names thru name= modifier (Alexey Budankov)

  PowerPC:

   - Add a python script for hypervisor call statistics (Ravi Bangoria)

  Intel PT: (Adrian Hunter)

   - Fix sync_switch INTEL_PT_SS_NOT_TRACING

   - Fix decoding to accept CBR between FUP and corresponding TIP

   - Fix MTC timing after overflow

   - Fix "Unexpected indirect branch" error

  perf test:

   - record+probe_libc_inet_pton:
      - To get the symbol table for dynamic shared objects on ubuntu we
        need to pass the -D/--dynamic command line option, unlike with
        the fedora distros (Arnaldo Carvalho de Melo)

   - code-reading:
      - Fix perf_env setup for PTI entry trampolines (Adrian Hunter)

   - kmod-path:
      - Add tests for vdso32 and vdsox32 (Adrian Hunter)

   - Use header file util/debug.h (Thomas Richter)

  perf annotate:

   - Make the various UI backends (stdio, TUI, gtk) use more
     consistently structs with annotation options as specified by the
     user (Arnaldo Carvalho de Melo)

   - Move annotation specific knobs from the symbol_conf global kitchen
     sink to the annotation option structs (Arnaldo Carvalho de Melo)

  perf script:

   - Add more PMU fields to python scripts event handler dict (Jin Yao)

  Core:

   - Fix misleading error for some unparsable events mentioning PMUs
     when those are not involved in the problem (Jiri Olsa)

   - Consider BSS symbols when processing /proc/kallsyms ('B' and 'b')
     (Arnaldo Carvalho de Melo)

   - Be more robust when trying to use per-symbol histograms, checking
     for unlikely but possible cases where the space for the histograms
     wasn't allocated, print a debug message for such cases (Arnaldo
     Carvalho de Melo)

   - Fix symbol and object code resolution for vdso32 and vdsox32
     (Adrian Hunter)

   - No need to check for null when passing pointers to foo__get() style
     refcount grabbing helpers, just like in the kernel and with free(),
     its safe to pass a NULL pointer to avoid having to check it before
     each and every foo__get() call (Arnaldo Carvalho de Melo)

   - Remove some dead code (quote.[ch]) (Arnaldo Carvalho de Melo)

   - Remove some needless globals, making them local (Arnaldo Carvalho
     de Melo)

   - Reduce usage of symbol_conf.use_callchain, using other means of
     finding out if callchains are in use or available for specific
     events, as we evolved this codebase to allow requesting callchains
     for just a subset of the monitored events. In time it will help
     polish recording and showing mixed sets accross the various tools:

        perf record -e cycles/call-graph=fp/,cache-misses/call-graph=dwarf/,instructions'

     (Arnaldo Carvalho de Melo)

   - Consider PTI entry trampolines in map__rip_2objdump() (Adrian
     Hunter)"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (50 commits)
  perf script python: Add dict fields introduction to Documentation
  perf script python: Add more PMU fields to event handler dict
  perf script python: Move dsoname code to a new function
  perf symbols: Add BSS symbols when reading from /proc/kallsyms
  perf annnotate: Make __symbol__inc_addr_samples handle src->histograms == NULL
  perf intel-pt: Fix "Unexpected indirect branch" error
  perf intel-pt: Fix MTC timing after overflow
  perf intel-pt: Fix decoding to accept CBR between FUP and corresponding TIP
  perf intel-pt: Fix sync_switch INTEL_PT_SS_NOT_TRACING
  perf script powerpc: Python script for hypervisor call statistics
  perf test record+probe_libc_inet_pton: Ask 'nm' for dynamic symbols
  perf map: Consider PTI entry trampolines in rip_2objdump()
  perf test code-reading: Fix perf_env setup for PTI entry trampolines
  perf tools: Fix pmu events parsing rule
  perf stat: Display user and system time
  perf record: Enable arbitrary event names thru name= modifier
  perf tools: Fix symbol and object code resolution for vdso32 and vdsox32
  perf tests kmod-path: Add tests for vdso32 and vdsox32
  perf hists: Check if a hist_entry has callchains before using them
  perf hists: Introduce hist_entry__has_callchain() method
  ...

Merge branch 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull irq fixes from Thomas Gleixner:
"Two small fixlets:

   - Add the missing iomu mapping call in the Freescale/NXP/Qualcomm/
     whoever owns it now/ SCFG MSI irqchip driver. Otherwise IRQs wont
     work at all.

   - Fix a SMP=n build warning in the STM32 irq chip driver"

* 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  irqchip/ls-scfg-msi: Map MSIs in the iommu
  irqchip/stm32: Fix non-SMP build warning

Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull core fixes from Thomas Gleixner:
"A small set of core updates:

   - Make objtool cope with GCC8 oddities some more

   - Remove a stale local_irq_save/restore sequence in the signal code
     along with the stale comment in the RCU code. The underlying issue
     which led to this has been solved long time ago, but nobody cared
     to cleanup the hackarounds"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  signal: Remove no longer required irqsave/restore
  rcu: Update documentation of rcu_read_unlock()
  objtool: Fix GCC 8 cold subfunction detection for aliased functions

signal: Remove no longer required irqsave/restore

Commit 2775f97369c3 ("signal: align __lock_task_sighand() irq disabling and
RCU") introduced a rcu read side critical section with interrupts
disabled. The changelog suggested that a better long-term fix would be "to
make rt_mutex_unlock() disable irqs when acquiring the rt_mutex structure's
->wait_lock".

This long-term fix has been made in commit cabffac91fbb ("rtmutex: Make
wait_lock irq safe") for a different reason.

Therefore revert commit 2775f97369c3 ("signal: align >
__lock_task_sighand() irq disabling and RCU") as the interrupt disable
dance is not longer required.

The change was tested on the base of cabffac91fbb ("rtmutex: Make wait_lock
irq safe") with a four hour run of rcutorture scenario TREE03 with lockdep
enabled as suggested by Paul McKenney.

Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: bigeasy@linutronix.de
Link: https://lkml.kernel.org/r/20180525090507.22248-3-anna-maria@linutronix.de

rcu: Update documentation of rcu_read_unlock()

Since commit cabffac91fbb ("rtmutex: Make wait_lock irq safe") the
explanation in rcu_read_unlock() documentation about irq unsafe rtmutex
wait_lock is no longer valid.

Remove it to prevent kernel developers reading the documentation to rely on
it.

Suggested-by: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: bigeasy@linutronix.de
Link: https://lkml.kernel.org/r/20180525090507.22248-2-anna-maria@linutronix.de

Merge branch 'proc-cmdline'

Merge proc_cmdline simplifications.

This re-writes the get_mm_cmdline() logic to be rather simpler than it
used to be, and makes the semantics for "cmdline goes past the end of
the original area" more natural.

You _can_ use prctl(PR_SET_MM) to just point your command line somewhere
else entirely, but the traditional model is to just edit things in place
and that still needs to continue to work.  At least this way the code
makes some sense.

* proc-cmdline:
  fs/proc: simplify and clarify get_mm_cmdline() function
  fs/proc: re-factor proc_pid_cmdline_read() a bit

hpfs: Use EUCLEAN for filesystem errors

Use the error code EUCLEAN for filesystem errors because other
filesystems use this code too.

[ And remove unused EMEMERROR - Linus ]

Signed-off-by: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Merge tag 'for-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply

Pull power supply and reset updates from Sebastian Reichel:
- bq27xxx: Add BQ27426 support
- ab8500: Drop AB8540/9540 support
- Introduced new usb_type property
- Properly document the power-supply ABI
- misc. cleanups and fixes

* tag 'for-v4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply:
  MAINTAINERS: add entry for LEGO MINDSTORMS EV3
  power: supply: ab8500_charger: fix spelling mistake: "faile" -> "failed"
  power: supply: axp288_fuel_gauge: Remove polling from the driver
  power: supply: axp288_fuelguage: Do not bind when the fg function is not used
  power: supply: axp288_charger: Do not bind when the charge function is not used
  power: supply: axp288_charger: Support 3500 and 4000 mA input current limit
  power: supply: s3c-adc-battery: fix driver data initialization
  power: supply: charger-manager: Verify polling interval only when polling requested
  power: supply: sysfs: Use enum to specify property
  power: supply: ab8500: Drop AB8540/9540 support
  power: supply: ab8500_fg: fix spelling mistake: "Disharge" -> "Discharge"
  power: supply: simplify getting .drvdata
  power: supply: bq27xxx: Add support for BQ27426
  gpio-poweroff: Use gpiod_set_value_cansleep

Merge tag 'hsi-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi

Pull HSI update from Sebastian Reichel:
"Just one patch for the HSI subsystem this time: use the new vm_fault_t
return type"

* tag 'hsi-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-hsi:
hsi: clients: Change return type to vm_fault_t

Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux

Pull clk updates from Stephen Boyd:
"This time we have a good set of changes to the core framework that do
  some general cleanups, but nothing too major. The majority of the diff
  goes to two SoCs, Actions Semi and Qualcomm. A brand new driver is
  introduced for Actions Semi so it takes up some lines to add all the
  different types, and the Qualcomm diff is there because we add support
  for two SoCs and it's quite a bit of data.

  Otherwise the big driver updates are on TI Davinci and Amlogic
  platforms. And then the long tail of driver updates for various fixes
  and stuff follows after that.

  Core:
   - debugfs cleanups removing error checking and an unused provider API
   - Removal of a clk init typedef that isn't used
   - Usage of match_string() to simplify parent string name matching
   - OF clk helpers moved to their own file (linux/of_clk.h)
   - Make clk warnings more readable across kernel versions

  New Drivers:
   - Qualcomm SDM845 GCC and Video clk controllers
   - Qualcomm MSM8998 GCC
   - Actions Semi S900 SoC support
   - Nuvoton npcm750 microcontroller clks
   - Amlogic axg AO clock controller

  Removed Drivers:
   - Deprecated Rockchip clk-gate driver

  Updates:
   - debugfs functions stopped checking return values
   - Support for the MSIOF module clocks on Rensas R-Car M3-N
   - Support for the new Rensas RZ/G1C and R-Car E3 SoCs
   - Qualcomm GDSC, RCG, and PLL updates for clk changes in new SoCs
   - Berlin and Amlogic SPDX tagging
   - Usage of of_clk_get_parent_count() in more places
   - Proper implementation of the CDEV1/2 clocks on Tegra20
   - Allwinner H6 PRCM clock support and R40 EMAC support
   - Add critical flag to meson8b's fdiv2 as temporary fixup for ethernet
   - Round closest support for meson's mpll driver
   - Support for meson8b nand clocks and gxbb video decoder clocks
   - Mediatek mali clks
   - STM32MP1 fixes
   - Uniphier LD11/LD20 stream demux system clock"

* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (134 commits)
  clk: qcom: Export clk_fabia_pll_configure()
  clk: bcm: Update and add Stingray clock entries
  dt-bindings: clk: Update Stingray binding doc
  clk-si544: Properly round requested frequency to nearest match
  clk: ingenic: jz4770: Add 150us delay after enabling VPU clock
  clk: ingenic: jz4770: Enable power of AHB1 bus after ungating VPU clock
  clk: ingenic: jz4770: Modify C1CLK clock to disable CPU clock stop on idle
  clk: ingenic: jz4770: Change OTG from custom to standard gated clock
  clk: ingenic: Support specifying "wait for clock stable" delay
  clk: ingenic: Add support for clocks whose gate bit is inverted
  clk: use match_string() helper
  clk: bcm2835: use match_string() helper
  clk: Return void from debug_init op
  clk: remove clk_debugfs_add_file()
  clk: tegra: no need to check return value of debugfs_create functions
  clk: davinci: no need to check return value of debugfs_create functions
  clk: bcm2835: no need to check return value of debugfs_create functions
  clk: no need to check return value of debugfs_create functions
  clk: imx6: add EPIT clock support
  clk: mvebu: use correct bit for 98DX3236 NAND
  ...

Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md

Pull MD updates from Shaohua Li:
"A few fixes of MD for this merge window. Mostly bug fixes:

   - raid5 stripe batch fix from Amy

   - Read error handling for raid1 FailFast device from Gioh

   - raid10 recovery NULL pointer dereference fix from Guoqing

   - Support write hint for raid5 stripe cache from Mariusz

   - Fixes for device hot add/remove from Neil and Yufen

   - Improve flush bio scalability from Xiao"

* 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
  MD: fix lock contention for flush bios
  md/raid5: Assigning NULL to sh->batch_head before testing bit R5_Overlap of a stripe
  md/raid1: add error handling of read error from FailFast device
  md: fix NULL dereference of mddev->pers in remove_and_add_spares()
  raid5: copy write hint from origin bio to stripe
  md: fix two problems with setting the "re-add" device state.
  raid10: check bio in r10buf_pool_free to void NULL pointer dereference
  md: fix an error code format and remove unsed bio_sector

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc

Pull sparc updates from David Miller:

- a FPE signal fix that was also merged upstream

- privileged ADI driver from Tom Hromatka

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc:
  sparc: fix compat siginfo ABI regression
  selftests: sparc64: char: Selftest for privileged ADI driver
  char: sparc64: Add privileged ADI driver

Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide

Pull IDE updates from David Miller:
"Primarily IRQ disabling avoidance changes from Sebastian Andrzej
  Siewior"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/ide:
  ide: don't enable/disable interrupts in force threaded-IRQ mode
  ide: don't disable interrupts during kmap_atomic()
  ide: Handle irq disabling consistently
  alim15x3: move irq-restore before pci_dev_put()

Merge tag 'staging-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging

Pull staging/IIO updates from Greg KH:
"Here is the big staging and IIO driver update for 4.18-rc1.

  It was delayed as I wanted to make sure the final driver deletions did
  not cause any major merge issues, and all now looks good.

  There are a lot of patches here, just over 1000. The diffstat summary
  shows the major changes here:

1007 files changed, 16828 insertions(+), 227770 deletions(-)

  Because of this, we might be close to shrinking the overall kernel
  source code size for two releases in a row.

  There was loads of work in this release cycle, primarily:

   - tons of ks7010 driver cleanups

   - lots of mt7621 driver fixes and cleanups

   - most driver cleanups

   - wilc1000 fixes and cleanups

   - lots and lots of IIO driver cleanups and new additions

   - debugfs cleanups for all staging drivers

   - lots of other staging driver cleanups and fixes, the shortlog has
     the full details.

  but the big user-visable things here are the removal of 3 chunks of
  code:

   - ncpfs and ipx were removed on schedule, no one has cared about this
     code since it moved to staging last year, and if it needs to come
     back, it can be reverted.

   - lustre file system is removed.

     I've ranted at the lustre developers about once a year for the past
     5 years, with no real forward progress at all to clean things up
     and get the code into the "real" part of the kernel.

     Given that the lustre developers continue to work on an external
     tree and try to port those changes to the in-kernel tree every once
     in a while, this whole thing really really is not working out at
     all. So I'm deleting it so that the developers can spend the time
     working in their out-of-tree location and get things cleaned up
     properly to get merged into the tree correctly at a later date.

  Because of these file removals, you will have merge issues on some of
  these files (2 in the ipx code, 1 in the ncpfs code, and 1 in the
  atomisp driver). Just delete those files, it's a simple merge :)

  All of this has been in linux-next for a while with no reported
  problems"

* tag 'staging-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (1011 commits)
  staging: ipx: delete it from the tree
  ncpfs: remove uapi .h files
  ncpfs: remove Documentation
  ncpfs: remove compat functionality
  staging: ncpfs: delete it
  staging: lustre: delete the filesystem from the tree.
  staging: vc04_services: no need to save the log debufs dentries
  staging: vc04_services: vchiq_debugfs_log_entry can be a void *
  staging: vc04_services: remove struct vchiq_debugfs_info
  staging: vc04_services: move client dbg directory into static variable
  staging: vc04_services: remove odd vchiq_debugfs_top() wrapper
  staging: vc04_services: no need to check debugfs return values
  staging: mt7621-gpio: reorder includes alphabetically
  staging: mt7621-gpio: change gc_map to don't use pointers
  staging: mt7621-gpio: use GPIOF_DIR_OUT and GPIOF_DIR_IN macros instead of custom values
  staging: mt7621-gpio: change 'to_mediatek_gpio' to make just a one line return
  staging: mt7621-gpio: dt-bindings: update documentation for #interrupt-cells property
  staging: mt7621-gpio: update #interrupt-cells for the gpio node
  staging: mt7621-gpio: dt-bindings: complete documentation for the gpio
  staging: mt7621-dts: add missing properties to gpio node
  ...

x86/intel_rdt: Enable CMT and MBM on new Skylake stepping

New stepping of Skylake has fixes for cache occupancy and memory
bandwidth monitoring.

Update the code to enable these by default on newer steppings.

Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: stable@vger.kernel.org # v4.14
Cc: Vikas Shivappa <vikas.shivappa@linux.intel.com>
Link: https://lkml.kernel.org/r/20180608160732.9842-1-tony.luck@intel.com

block: fix use-after-free in block flush handling

A recent commit reused the original request flags for the flush
queue handling. However, for some of the kick flush cases, the
original request was already completed. This caused a use after
free, if blk-mq wasn't used.

Fixes: 8bcc4b76e82e ("block: pass failfast and driver-specific flags to flush requests")
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>

Merge tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm

Pull libnvdimm updates from Dan Williams:
"This adds a user for the new 'bytes-remaining' updates to
  memcpy_mcsafe() that you already received through Ingo via the
  x86-dax- for-linus pull.

  Not included here, but still targeting this cycle, is support for
  handling memory media errors (poison) consumed via userspace dax
  mappings.

  Summary:

   - DAX broke a fundamental assumption of truncate of file mapped
     pages. The truncate path assumed that it is safe to disconnect a
     pinned page from a file and let the filesystem reclaim the physical
     block. With DAX the page is equivalent to the filesystem block.
     Introduce dax_layout_busy_page() to enable filesystems to wait for
     pinned DAX pages to be released. Without this wait a filesystem
     could allocate blocks under active device-DMA to a new file.

   - DAX arranges for the block layer to be bypassed and uses
     dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
     However, the memcpy_mcsafe() facility is available through the pmem
     block driver. In order to safely handle media errors, via the DAX
     block-layer bypass, introduce copy_to_iter_mcsafe().

   - Fix cache management policy relative to the ACPI NFIT Platform
     Capabilities Structure to properly elide cache flushes when they
     are not necessary. The table indicates whether CPU caches are
     power-fail protected. Clarify that a deep flush is always performed
     on REQ_{FUA,PREFLUSH} requests"

* tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
  dax: Use dax_write_cache* helpers
  libnvdimm, pmem: Do not flush power-fail protected CPU caches
  libnvdimm, pmem: Unconditionally deep flush on *sync
  libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
  acpi, nfit: Remove ecc_unit_size
  dax: dax_insert_mapping_entry always succeeds
  libnvdimm, e820: Register all pmem resources
  libnvdimm: Debug probe times
  linvdimm, pmem: Preserve read-only setting for pmem devices
  x86, nfit_test: Add unit test for memcpy_mcsafe()
  pmem: Switch to copy_to_iter_mcsafe()
  dax: Report bytes remaining in dax_iomap_actor()
  dax: Introduce a ->copy_to_iter dax operation
  uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
  xfs, dax: introduce xfs_break_dax_layouts()
  xfs: prepare xfs_break_layouts() for another layout type
  xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
  mm, fs, dax: handle layout changes to pinned dax mappings
  mm: fix __gup_device_huge vs unmap
  mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
  ...

net: bridge: Fix locking in br_fdb_find_port()

Callers of br_fdb_find() need to hold the hash lock, which
br_fdb_find_port() doesn't do. However, since br_fdb_find_port() is not
doing any actual FDB manipulation, the hash lock is not really needed at
all. So convert to br_fdb_find_rcu(), surrounded by rcu_read_lock() /
_unlock() pair.

The device pointer copied from inside the FDB entry is then kept alive
by the RTNL lock, which br_fdb_find_port() asserts.

Fixes: 3a6127e211c2 ("net: bridge: Publish bridge accessor functions")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

udp: fix rx queue len reported by diag and proc interface

After commit c93c563246ef ("udp: add batching to udp_rmem_release()")
the sk_rmem_alloc field does not measure exactly anymore the
receive queue length, because we batch the rmem release. The issue
is really apparent only after commit 2924ededf5ae ("udp: do rmem bulk
free even if the rx sk queue is empty"): the user space can easily
check for an empty socket with not-0 queue length reported by the 'ss'
tool or the procfs interface.

We need to use a custom UDP helper to report the correct queue length,
taking into account the forward allocation deficit.

Reported-by: trevor.francis@46labs.com
Fixes: c93c563246ef ("UDP: add batching to udp_rmem_release()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cdc_ncm: avoid padding beyond end of skb

Commit fac67fb48c0c ("cdc_ncm: Add support for moving NDP to end
of NCM frame") added logic to reserve space for the NDP at the
end of the NTB/skb.  This reservation did not take the final
alignment of the NDP into account, causing us to reserve too
little space. Additionally the padding prior to NDP addition did
not ensure there was enough space for the NDP.

The NTB/skb with the NDP appended would then exceed the configured
max size. This caused the final padding of the NTB to use a
negative count, padding to almost INT_MAX, and resulting in:

[60103.825970] BUG: unable to handle kernel paging request at ffff9641f2004000
[60103.825998] IP: __memset+0x24/0x30
[60103.826001] PGD a6a06067 P4D a6a06067 PUD 4f65a063 PMD 72003063 PTE 0
[60103.826013] Oops: 0002 [#1] SMP NOPTI
[60103.826018] Modules linked in: (removed(
[60103.826158] CPU: 0 PID: 5990 Comm: Chrome_DevTools Tainted: G           O 4.14.0-3-amd64 #1 Debian 4.14.17-1
[60103.826162] Hardware name: LENOVO 20081 BIOS 41CN28WW(V2.04) 05/03/2012
[60103.826166] task: ffff964193484fc0 task.stack: ffffb2890137c000
[60103.826171] RIP: 0010:__memset+0x24/0x30
[60103.826174] RSP: 0000:ffff964316c03b68 EFLAGS: 00010216
[60103.826178] RAX: 0000000000000000 RBX: 00000000fffffffd RCX: 000000001ffa5000
[60103.826181] RDX: 0000000000000005 RSI: 0000000000000000 RDI: ffff9641f2003ffc
[60103.826184] RBP: ffff964192f6c800 R08: 00000000304d434e R09: ffff9641f1d2c004
[60103.826187] R10: 0000000000000002 R11: 00000000000005ae R12: ffff9642e6957a80
[60103.826190] R13: ffff964282ff2ee8 R14: 000000000000000d R15: ffff9642e4843900
[60103.826194] FS:  00007f395aaf6700(0000) GS:ffff964316c00000(0000) knlGS:0000000000000000
[60103.826197] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[60103.826200] CR2: ffff9641f2004000 CR3: 0000000013b0c000 CR4: 00000000000006f0
[60103.826204] Call Trace:
[60103.826212]  <IRQ>
[60103.826225]  cdc_ncm_fill_tx_frame+0x5e3/0x740 [cdc_ncm]
[60103.826236]  cdc_ncm_tx_fixup+0x57/0x70 [cdc_ncm]
[60103.826246]  usbnet_start_xmit+0x5d/0x710 [usbnet]
[60103.826254]  ? netif_skb_features+0x119/0x250
[60103.826259]  dev_hard_start_xmit+0xa1/0x200
[60103.826267]  sch_direct_xmit+0xf2/0x1b0
[60103.826273]  __dev_queue_xmit+0x5e3/0x7c0
[60103.826280]  ? ip_finish_output2+0x263/0x3c0
[60103.826284]  ip_finish_output2+0x263/0x3c0
[60103.826289]  ? ip_output+0x6c/0xe0
[60103.826293]  ip_output+0x6c/0xe0
[60103.826298]  ? ip_forward_options+0x1a0/0x1a0
[60103.826303]  tcp_transmit_skb+0x516/0x9b0
[60103.826309]  tcp_write_xmit+0x1aa/0xee0
[60103.826313]  ? sch_direct_xmit+0x71/0x1b0
[60103.826318]  tcp_tasklet_func+0x177/0x180
[60103.826325]  tasklet_action+0x5f/0x110
[60103.826332]  __do_softirq+0xde/0x2b3
[60103.826337]  irq_exit+0xae/0xb0
[60103.826342]  do_IRQ+0x81/0xd0
[60103.826347]  common_interrupt+0x98/0x98
[60103.826351]  </IRQ>
[60103.826355] RIP: 0033:0x7f397bdf2282
[60103.826358] RSP: 002b:00007f395aaf57d8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff6e
[60103.826362] RAX: 0000000000000000 RBX: 00002f07bc6d0900 RCX: 00007f39752d7fe7
[60103.826365] RDX: 0000000000000022 RSI: 0000000000000147 RDI: 00002f07baea02c0
[60103.826368] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[60103.826371] R10: 00000000ffffffff R11: 0000000000000000 R12: 00002f07baea02c0
[60103.826373] R13: 00002f07bba227a0 R14: 00002f07bc6d090c R15: 0000000000000000
[60103.826377] Code: 90 90 90 90 90 90 90 0f 1f 44 00 00 49 89 f9 48 89 d1 83
e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48
ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1
[60103.826442] RIP: __memset+0x24/0x30 RSP: ffff964316c03b68
[60103.826444] CR2: ffff9641f2004000

Commit 4e081cabbf8d ("net: cdc_ncm: Reduce memory use when kernel
memory low") made this bug much more likely to trigger by reducing
the NTB size under memory pressure.

Link: https://bugs.debian.org/893393
Reported-by: Горбешко Богдан <bodqhrohro@gmail.com>
Reported-and-tested-by: Dennis Wassenberg <dennis.wassenberg@secunet.com>
Cc: Enrico Mioso <mrkiko.rs@gmail.com>
Fixes: fac67fb48c0c ("cdc_ncm: Add support for moving NDP to end of NCM frame")
Signed-off-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/sched: act_simple: fix parsing of TCA_DEF_DATA

use nla_strlcpy() to avoid copying data beyond the length of TCA_DEF_DATA
netlink attribute, in case it is less than SIMP_MAX_DATA and it does not
end with '\0' character.

v2: fix errors in the commit message, thanks Hangbin Liu

Fixes: de4be0a909ae ("net_cls_act: Make act_simple use of netlink policy.")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>