git.baikalelectronics.ru Git

ipv6: lift copy_from_user out of ipv6_route_ioctl

Prepare for better compat ioctl handling by moving the user copy out
of ipv6_route_ioctl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: simplify phy_link_change arguments

This function was introduced to allow for different handling of
link up and link down events particularly with regard to the
netif_carrier. The third argument do_carrier allowed the flag to
be left unchanged.

Since then the phylink has introduced an implementation that
completely ignores the third parameter since it never wants to
change the flag and the phylib always sets the third parameter
to true so the flag is always changed.

Therefore the third argument (i.e. do_carrier) is no longer
necessary and can be removed. This also means that the phylib
phy_link_down() function no longer needs its second argument.

Signed-off-by: Doug Berger <opendmb@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

rds: convert get_user_pages() --> pin_user_pages()

This code was using get_user_pages_fast(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages_fast() + put_page() calls to
pin_user_pages_fast() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/

Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mptcp-do-not-block-on-subflow-socket'

Florian Westphal says:

====================
mptcp: do not block on subflow socket

This series reworks mptcp_sendmsg logic to avoid blocking on the subflow
socket.

It does so by removing the wait loop from mptcp_sendmsg_frag helper.

In order to do that, it moves prerequisites that are currently
handled in mptcp_sendmsg_frag (and cause it to wait until they are
met, e.g. frag cache refill) into the callers.

The worker can just reschedule in case no subflow socket is ready,
since it can't wait -- doing so would block other work items and
doesn't make sense anyway because we should not (re)send data
in case resources are already low.

The sendmsg path can use the existing wait logic until memory
becomes available.

Because large send requests can result in multiple mptcp_sendmsg_frag
calls from mptcp_sendmsg, we may need to restart the socket lookup in
case subflow can't accept more data or memory is low.

Doing so blocks on the mptcp socket, and existing wait handling
releases the msk lock while blocking.

Lastly, no need to use GFP_ATOMIC for extension allocation:
extend __skb_ext_alloc with gfp_t arg instead of hard-coded ATOMIC and
then relax the allocation constraints for mptcp case: those requests
occur in process context.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: allow __skb_ext_alloc to sleep

mptcp calls this from the transmit side, from process context.
Allow a sleeping allocation instead of unconditional GFP_ATOMIC.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: remove inner wait loop from mptcp_sendmsg_frag

previous patches made sure we only call into this function
when these prerequisites are met, so no need to wait on the
subflow socket anymore.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/7
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: fill skb page frag cache outside of mptcp_sendmsg_frag

The mptcp_sendmsg_frag helper contains a loop that will wait on the
subflow sk.

It seems preferrable to only wait in mptcp_sendmsg() when blocking io is
requested. mptcp_sendmsg already has such a wait loop that is used when
no subflow socket is available for transmission.

This is another preparation patch that makes sure we call
mptcp_sendmsg_frag only if the page frag cache has been refilled.

Followup patch will remove the wait loop from mptcp_sendmsg_frag().

The retransmit worker doesn't need to do this refill as it won't
transmit new mptcp-level data.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: fill skb extension cache outside of mptcp_sendmsg_frag

The mptcp_sendmsg_frag helper contains a loop that will wait on the
subflow sk.

It seems preferrable to only wait in mptcp_sendmsg() when blocking io is
requested. mptcp_sendmsg already has such a wait loop that is used when
no subflow socket is available for transmission.

This is a preparation patch that makes sure we call
mptcp_sendmsg_frag only if a skb extension has been allocated.

Moreover, such allocation currently uses GFP_ATOMIC while it
could use sleeping allocation instead.

Followup patches will remove the wait loop from mptcp_sendmsg_frag()
and will allow to do a sleeping allocation for the extension.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: avoid blocking in tcp_sendpages

The transmit loop continues to xmit new data until an error is returned
or all data was transmitted.

For the blocking i/o case, this means that tcp_sendpages() may block on
the subflow until more space becomes available, i.e. we end up sleeping
with the mptcp socket lock held.

Instead we should check if a different subflow is ready to be used.

This restarts the subflow sk lookup when the tx operation succeeded
and the tcp subflow can't accept more data or if tcp_sendpages
indicates -EAGAIN on a blocking mptcp socket.

In that case we also need to set the NOSPACE bit to make sure we get
notified once memory becomes available.

In case all subflows are busy, the existing logic will wait until a
subflow is ready, releasing the mptcp socket lock while doing so.

The mptcp worker already sets DONTWAIT, so no need to make changes there.

v2:
* set NOSPACE bit
* add a comment to clarify that mptcp-sk sndbuf limits need to
be checked as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: break and restart in case mptcp sndbuf is full

Its not enough to check for available tcp send space.

We also hold on to transmitted data for mptcp-level retransmits.
Right now we will send more and more data if the peer can ack data
at the tcp level fast enough, since that frees up tcp send buffer space.

But we also need to check that data was acked and reclaimed at the mptcp
level.

Therefore add needed check in mptcp_sendmsg, flush tcp data and
wait until more mptcp snd space becomes available if we are over the
limit.  Before we wait for more data, also make sure we start the
retransmit timer if we ran out of sndbuf space.

Otherwise there is a very small chance that we wait forever:

* receiver is waiting for data
* sender is blocked because mptcp socket buffer is full
* at tcp level, all data was acked
* mptcp-level snd_una was not updated, because last ack
   that acknowledged the last data packet carried an older
   MPTCP-ack.

Restarting the retransmit timer avoids this problem: if TCP
subflow is idle, data is retransmitted from the RTX queue.

New data will make the peer send a new, updated MPTCP-Ack.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: move common nospace-pattern to a helper

Paolo noticed that ssk_check_wmem() has same pattern, so add/use
common helper for both places.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: Drop 'pref medium' in route checks

The 'pref medium' attribute was moved in iproute2 to be near the prefix
which is where it applies versus after the last nexthop. The nexthop
tests were updated to drop the string from route checking, but it crept
in again with the compat tests.

Fixes: fd1e40ea17f1 ("selftests: net: add new testcases for nexthop API compat mode sysctl")
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-ipa-sc7180-suspend-resume'

Alex Elder says:

====================
net: ipa: sc7180 suspend/resume

This series permits suspend/resume to work for the IPA driver
on the Qualcomm SC7180 SoC. The IPA version on this SoC requires
interrupts to be enabled when the suspend and resume callbacks are
made, and the first patch moves away from using the noirq variants.
The second patch fixes a problem with resume that occurs because
pending interrupts were being cleared before starting a channel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: do not clear interrupt in gsi_channel_start()

In gsi_channel_start() there is harmless-looking comment "Clear the
channel's event ring interrupt in case it's pending". The intent
was to avoid getting spurious interrupts when first bringing up a
channel.

However we now use channel stop/start to implement suspend and
resume, and an interrupt pending at the time we resume is actually
something we don't want to ignore.

The very first time we bring up the channel we do not expect an
interrupt to be pending, and even if it were, the effect would
simply be to schedule NAPI on that channel, which would find nothing
to do, which is not a problem.

Stop clearing any pending IEOB interrupt in gsi_channel_start().
That leaves one caller of the trivial function gsi_isr_ieob_clear().
Get rid of that function and just open-code it in gsi_isr_ieob()
instead.

This fixes a problem where suspend/resume IPA v4.2 would get stuck
when resuming after a suspend.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: ipa: don't use noirq suspend/resume callbacks

Use the suspend and resume callbacks rather than suspend_noirq and
resume_noirq. With IPA v4.2, we use the CHANNEL_STOP command to
implement a suspend, and without interrupts enabled, that command
won't complete.

Signed-off-by: Alex Elder <elder@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-Reorganize-trap-data'

Ido Schimmel says:

====================
mlxsw: Reorganize trap data

This patch set does not include any functional changes. It merely
reworks the internal storage of traps, trap groups and trap policers in
mlxsw to each use a single array.

These changes allow us to get rid of the multiple arrays we currently
have for traps, which make the trap data easier to validate and extend
with more per-trap information in the future. It will also allow us to
more easily add per-ASIC traps in future submissions.

Last two patches include minor changes to devlink-trap selftests.

Tested with existing devlink-trap selftests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Do not hard code trap group name

It can be derived dynamically from the trap's name, so drop it.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: devlink_lib: Remove double blank line

One blank line is enough.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_trap: Store all trap data in one array

Each trap registered with devlink is mapped to one or more Rx listeners.
These listeners allow the switch driver (e.g., mlxsw_spectrum) to
register a function that is called when a packet is received (trapped)
for a specific reason.

Currently, three arrays are used to describe the mapping between the
logical devlink traps and the Rx listeners.

Instead, get rid of these arrays and store all the information in one
array that is easier to validate and extend with more per-trap
information.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_trap: Store all trap group data in one array

Use one array to store all the information about all the trap groups
instead of hard coding it in code. This will be used in future patches
to disable certain functionality (e.g., policer binding) on a trap group
basis.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_trap: Store all trap policer data in one array

Instead of maintaining an array of policers and a linked list, only
maintain an array.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mlxsw: spectrum_trap: Move struct definition out of header file

'struct mlxsw_sp_trap_policer_item' is only used in one file, so move it
there.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

r8169: remove remaining call to mdiobus_unregister

After having switched to devm_mdiobus_register() also this remaining
call to mdiobus_unregister() can be removed.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ethtool-set_channels-add-a-few-more-checks'

Jakub Kicinski says:

====================
ethtool: set_channels: add a few more checks

There seems to be a few more things we can check in the core before
we call drivers' ethtool_ops->set_channels. Adding the checks to
the core simplifies the drivers. This set only includes changes
to the NFP driver as an example.

There is a small risk in the first patch that someone actually
purposefully accepts a strange configuration without RX or TX
channels, but I couldn't find such a driver in the tree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: don't call set_channels in drivers if config didn't change

Don't call drivers if nothing changed. Netlink code already
contains this logic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

nfp: don't check lack of RX/TX channels

Core will now perform this check.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

ethtool: check if there is at least one channel for TX/RX in the core

Having a channel config with no ability to RX or TX traffic is
clearly wrong. Check for this in the core so the drivers don't
have to.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: Use 32-bit DATA_ACK when possible

RFC8684 allows to send 32-bit DATA_ACKs as long as the peer is not
sending 64-bit data-sequence numbers. The 64-bit DSN is only there for
extreme scenarios when a very high throughput subflow is combined with a
long-RTT subflow such that the high-throughput subflow wraps around the
32-bit sequence number space within an RTT of the high-RTT subflow.

It is thus a rare scenario and we should try to use the 32-bit DATA_ACK
instead as long as possible. It allows to reduce the TCP-option overhead
by 4 bytes, thus makes space for an additional SACK-block. It also makes
tcpdumps much easier to read when the DSN and DATA_ACK are both either
32 or 64-bit.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

netns: enable to inherit devconf from current netns

The goal is to be able to inherit the initial devconf parameters from the
current netns, ie the netns where this new netns has been created.

This is useful in a containers environment where /proc/sys is read only.
For example, if a pod is created with specifics devconf parameters and has
the capability to create netns, the user expects to get the same parameters
than his 'init_net', which is not the real init_net in this case.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: add bulking to XDP_TX

Add driver level bulking to the XDP_TX action.

An array of frame descriptors is held for each Tx frame queue and
populated accordingly when the action returned by the XDP program is
XDP_TX. The frames will be actually enqueued only when the array is
filled. At the end of the NAPI cycle a flush on the queued frames is
performed in order to enqueue the remaining FDs.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: fix checkpatch complains about tabs

This patch makes checkpatch happy for tabs

Signed-off-by: Kevin Lo <kevlo@kevlo.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-updates-2020-05-15' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-05-15

mlx5 core and mlx5e (netdev) updates:

1) Two fixes for release all FW pages support.
2) Improvement in calculating the send queue stop room on tx
3) Flow steering auto-groups creation improvements
4) TC offload fix for Connection tracking with NAT action
5) IPoIB support for self looback to allow communication between ipoib
pkey child interfaces on the same host.
6) DCBNL cleanup to avoid #ifdef DCBNL all over the main mlx5e code
7) Small and trivial code cleanup
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ethernet: ti: am65-cpts: Add missing inline qualifier to stub functions

When building with Clang:

In file included from drivers/net/ethernet/ti/am65-cpsw-ethtool.c:15:
drivers/net/ethernet/ti/am65-cpts.h:58:12: warning: unused function
'am65_cpts_ns_gettime' [-Wunused-function]
static s64 am65_cpts_ns_gettime(struct am65_cpts *cpts)
           ^
drivers/net/ethernet/ti/am65-cpts.h:63:12: warning: unused function
'am65_cpts_estf_enable' [-Wunused-function]
static int am65_cpts_estf_enable(struct am65_cpts *cpts,
           ^
drivers/net/ethernet/ti/am65-cpts.h:69:13: warning: unused function
'am65_cpts_estf_disable' [-Wunused-function]
static void am65_cpts_estf_disable(struct am65_cpts *cpts, int idx)
            ^
3 warnings generated.

These functions need to be marked as inline, which adds __maybe_unused,
to avoid these warnings, which is the pattern for stub functions.

Fixes: bd79fd79ba78 ("ethernet: ti: am65-cpts: add routines to support taprio offload")
Link: https://github.com/ClangBuiltLinux/linux/issues/1026
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net/mlx5e: Take DCBNL-related definitions into dedicated files

Take DCBNL-related definitions out of the common en.h header,
Use a dedicated header file for exposing them.
Some need not to be exposed, use them locally in the .c file.
Use stubs to eliminate use of CONFIG_MLX5_CORE_EN_DCB in the
generic control flows.

Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: Calculate SQ stop room in a robust way

Currently, different formulas are used to estimate the space that may be
taken by WQEs in the SQ during a single packet transmit. This space is
called stop room, and it's checked in the end of packet transmit to find
out if the next packet could overflow the SQ. If it could, the driver
tells the kernel to stop sending next packets.

Many factors affect the stop room:

1. Padding with NOPs to avoid WQEs spanning over page boundaries.

2. Enabled and disabled offloads (TLS, upcoming MPWQE).

3. The maximum size of a WQE.

The padding is performed before every WQE if it doesn't fit the current
page.

The current formula assumes that only one padding will be required per
packet, and it doesn't take into account that the WQEs posted during the
transmission of a single packet might exceed the page size in very rare
circumstances. For example, to hit this condition with 4096-byte pages,
TLS offload will have to interrupt an almost-full MPWQE session, be in
the resync flow and try to transmit a near to maximum amount of data.

To avoid SQ overflows in such rare cases after MPWQE is added, this
patch introduces a more robust formula to estimate the stop room. The
new formula uses the fact that a WQE of size X will not require more
than X-1 WQEBBs of padding. More exact estimations are possible, but
they result in much more complex and error-prone code for little gain.

Before this patch, the TLS stop room included space for both INNOVA and
ConnectX TLS offloads that couldn't run at the same time anyway, so this
patch accounts only for the active one.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: IPoIB, Drop multicast packets that this interface sent

After enabled loopback packets for IPoIB, we need to drop these packets
that this HCA has replicated and came back to the same interface that
sent them.

Fixes: 09650613aac5 ("net/mlx5e: IPoIB, Add PKEY child interface nic profile")
Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: IPoIB, Enable loopback packets for IPoIB interfaces

Enable loopback of unicast and multicast traffic for IPoIB enhanced
mode.
This will allow interfaces with the same pkey to communicate between
them e.g cloned interfaces that located in different namespaces.

Signed-off-by: Erez Shitrit <erezsh@mellanox.com>
Reviewed-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5e: CT: Fix offload with CT action after CT NAT action

It could be a chain of rules will do action CT again after CT NAT
Before this fix matching will break as we get into the CT table
after NAT changes and not CT NAT.
Fix this by adding pre ct and pre ct nat tables to skip ct/ct_nat
tables and go straight to post_ct table if ct/nat was already done.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Move internal timer read function to clock library

Move mlx5_read_internal_timer() into lib/clock.c file as it is being
used there. As such, make this function a static one.

In addition, rearrange headers include to support function move.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Aya Levin <ayal@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Wait for inactive autogroups

Currently, if one thread tries to add an entry to an autogrouped table
with no free matching group, while another thread is in the process of
creating a new matching autogroup, it doesn't wait for the new group
creation, and creates an unnecessary new autogroup.

Instead of skipping inactive, wait on the write lock of those groups.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Drain wq first during PCI device removal

mlx5_unload_one() is done with cleanup = true only once.

So instead of doing health wq drain inside the if(), directly do
during PCI device removal.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Have single error unwinding path

Having multiple error unwinding path are error prone.
Lets have just one error unwinding path.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Fix a bug of releasing wrong chunks on > 4K page size systems

On systems with page size larger than 4K, a fwp object has few 4K chunks.
Fix a bug in fwp free flow where the chunk address was dropped and
fwp->addr was used instead (first chunk address). This caused a wrong
update of fwp->bitmask which later can cause errors in re-alloc fwp
chunk flow.

In order to fix this it, re-factor the release flow:
- Free 4k: Releases a specific 4k chunk inside the fwp, defined by
starting address.
- Free fwp: Unconditionally release the whole fwp and its resources.
Free addr will call free fwp if all chunks were released, in order to do
code sharing.

In addition, fix npages to count for all released chunks correctly.

Fixes: 3079fddb43b2 ("net/mlx5: Add support for release all pages event")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

net/mlx5: Dedicate fw page to the requesting function

The cited patch assumes that all chuncks in a fw page belong to the same
function, thus the driver must dedicate fw page to the requesting
function, which is actually what was intedned in the original fw pages
allocator design, hence the fwp->func_id !

Up until the cited patch everything worked ok, but now "relase all pages"
is broken on systems with page_size > 4k.

Fix this by dedicating fw page to the requesting function id via adding a
func_id parameter to alloc_4k() function.

Fixes: 3079fddb43b2 ("net/mlx5: Add support for release all pages event")
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net

Pull networking fixes from David Miller:

1) Fix sk_psock reference count leak on receive, from Xiyu Yang.

2) CONFIG_HNS should be invisible, from Geert Uytterhoeven.

3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this,
    from Maciej Żenczykowski.

4) ipv4 route redirect backoff wasn't actually enforced, from Paolo
    Abeni.

5) Fix netprio cgroup v2 leak, from Zefan Li.

6) Fix infinite loop on rmmod in conntrack, from Florian Westphal.

7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet.

8) Various bpf probe handling fixes, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits)
  selftests: mptcp: pm: rm the right tmp file
  dpaa2-eth: properly handle buffer size restrictions
  bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
  bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
  bpf: Restrict bpf_probe_read{, str}() only to archs where they work
  MAINTAINERS: Mark networking drivers as Maintained.
  ipmr: Add lockdep expression to ipmr_for_each_table macro
  ipmr: Fix RCU list debugging warning
  drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c
  net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810
  tcp: fix error recovery in tcp_zerocopy_receive()
  MAINTAINERS: Add Jakub to networking drivers.
  MAINTAINERS: another add of Karsten Graul for S390 networking
  drivers: ipa: fix typos for ipa_smp2p structure doc
  pppoe: only process PADT targeted at local interfaces
  selftests/bpf: Enforce returning 0 for fentry/fexit programs
  bpf: Enforce returning 0 for fentry/fexit progs
  net: stmmac: fix num_por initialization
  security: Fix the default value of secid_to_secctx hook
  libbpf: Fix register naming in PT_REGS s390 macros
  ...

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma fixes from Jason Gunthorpe:
"A few minor bug fixes for user visible defects, and one regression:

   - Various bugs from static checkers and syzkaller

   - Add missing error checking in mlx4

   - Prevent RTNL lock recursion in i40iw

   - Fix segfault in cxgb4 in peer abort cases

   - Fix a regression added in 5.7 where the IB_EVENT_DEVICE_FATAL could
     be lost, and wasn't delivered to all the FDs"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  RDMA/uverbs: Move IB_EVENT_DEVICE_FATAL to destroy_uobj
  RDMA/uverbs: Do not discard the IB_EVENT_DEVICE_FATAL event
  RDMA/iw_cxgb4: Fix incorrect function parameters
  RDMA/core: Fix double put of resource
  IB/core: Fix potential NULL pointer dereference in pkey cache
  IB/hfi1: Fix another case where pq is left on waitlist
  IB/i40iw: Remove bogus call to netdev_master_upper_dev_get()
  IB/mlx4: Test return value of calls to ib_get_cached_pkey
  RDMA/rxe: Always return ERR_PTR from rxe_create_mmap_info()
  i40iw: Fix error handling in i40iw_manage_arp_cache()

Merge tag 'linux-kselftest-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest

Pull kselftest fixes from Shuah Khan:

- lkdtm runner fixes to prevent dmesg clearing and shellcheck errors

- ftrace test handling when test module doesn't exist

- nsfs test fix to replace zero-length array with flexible-array

- dmabuf-heaps test fix to return clear error value

* tag 'linux-kselftest-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
  selftests/lkdtm: Use grep -E instead of egrep
  selftests/lkdtm: Don't clear dmesg when running tests
  selftests/ftrace: mark irqsoff_tracer.tc test as unresolved if the test module does not exist
  tools/testing: Replace zero-length array with flexible-array
  kselftests: dmabuf-heaps: Fix confused return value on expected error testing

Merge tag 'riscv-for-linus-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux

Pull RISC-V fixes from Palmer Dabbelt:
"A handful of build fixes, all found by Huawei's autobuilder.

  None of these patches should have any functional impact on kernels
  that build, and they're mostly related to various features
  intermingling with !MMU.

  While some of these might be better hoisted to generic code, it seems
  better to have the simple fixes in the meanwhile.

  As far as I know these are the only outstanding patches for 5.7"

* tag 'riscv-for-linus-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
  riscv: mmiowb: Fix implicit declaration of function 'smp_processor_id'
  riscv: pgtable: Fix __kernel_map_pages build error if NOMMU
  riscv: Make SYS_SUPPORTS_HUGETLBFS depends on MMU
  riscv: Disable ARCH_HAS_DEBUG_VIRTUAL if NOMMU
  riscv: Add pgprot_writecombine/device and PAGE_SHARED defination if NOMMU
  riscv: stacktrace: Fix undefined reference to `walk_stackframe'
  riscv: Fix unmet direct dependencies built based on SOC_VIRT
  riscv: perf: RISCV_BASE_PMU should be independent
  riscv: perf_event: Make some funciton static

Merge branch 'mptcp-fix-MP_JOIN-failure-handling'

Paolo Abeni says:

====================
mptcp: fix MP_JOIN failure handling

Currently if we hit an MP_JOIN failure on the third ack, the child socket is
closed with reset, but the request socket is not deleted, causing weird
behaviors.

The main problem is that MPTCP's MP_JOIN code needs to plug it's own
'valid 3rd ack' checks and the current TCP callbacks do not allow that.

This series tries to address the above shortcoming introducing a new MPTCP
specific bit in a 'struct tcp_request_sock' hole, and leveraging that to allow
tcp_check_req releasing the request socket when needed.

The above allows cleaning-up a bit current MPTCP hooking in tcp_check_req().

An alternative solution, possibly cleaner but more invasive, would be
changing the 'bool *own_req' syn_recv_sock() argument into 'int *req_status'
and let MPTCP set it to 'REQ_DROP'.

v1 -> v2:
- be more conservative about drop_req initialization

RFC -> v1:
- move the drop_req bit inside tcp_request_sock (Eric)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: cope better with MP_JOIN failure

Currently, on MP_JOIN failure we reset the child
socket, but leave the request socket untouched.

tcp_check_req will deal with it according to the
'tcp_abort_on_overflow' sysctl value - by default the
req socket will stay alive.

The above leads to inconsistent behavior on MP JOIN
failure, and bad listener overflow accounting.

This patch addresses the issue leveraging the infrastructure
just introduced to ask the TCP stack to drop the req on
failure.

The child socket is not freed anymore by subflow_syn_recv_sock(),
instead it's moved to a dead state and will be disposed by the
next sock_put done by the TCP stack, so that listener overflow
accounting is not affected by MP JOIN failure.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

inet_connection_sock: factor out destroy helper.

Move the steps to prepare an inet_connection_sock for
forced disposal inside a separate helper. No functional
changes inteded, this will just simplify the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: add new sock flag to deal with join subflows

MP_JOIN subflows must not land into the accept queue.
Currently tcp_check_req() calls an mptcp specific helper
to detect such scenario.

Such helper leverages the subflow context to check for
MP_JOIN subflows. We need to deal also with MP JOIN
failures, even when the subflow context is not available
due allocation failure.

A possible solution would be changing the syn_recv_sock()
signature to allow returning a more descriptive action/
error code and deal with that in tcp_check_req().

Since the above need is MPTCP specific, this patch instead
uses a TCP request socket hole to add a MPTCP specific flag.
Such flag is used by the MPTCP syn_recv_sock() to tell
tcp_check_req() how to deal with the request socket.

This change is a no-op for !MPTCP build, and makes the
MPTCP code simpler. It allows also the next patch to deal
correctly with MP JOIN failure.

v1 -> v2:
- be more conservative on drop_req initialization (Mat)

RFC -> v1:
- move the drop_req bit inside tcp_request_sock (Eric)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux

Pull arm64 fix from Catalin Marinas:
"Fix flush_icache_range() second argument in machine_kexec() to be an
address rather than size"

* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: fix the flush_icache_range arguments in machine_kexec

net: phy: tja11xx: execute cable test on link up

A typical 100Base-T1 link should be always connected. If the link is in
a shot or open state, it is a failure. In most cases, we won't be able
to automatically handle this issue, but we need to log it or notify user
(if possible).

With this patch, the cable will be tested on "ip l s dev .. up" attempt
and send ethnl notification to the user space.

This patch was tested with TJA1102 PHY and "ethtool --monitor" command.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Alexei Starovoitov says:

====================
pull-request: bpf 2020-05-15

The following pull-request contains BPF updates for your *net* tree.

We've added 9 non-merge commits during the last 2 day(s) which contain
a total of 14 files changed, 137 insertions(+), 43 deletions(-).

The main changes are:

1) Fix secid_to_secctx LSM hook default value, from Anders.

2) Fix bug in mmap of bpf array, from Andrii.

3) Restrict bpf_probe_read to archs where they work, from Daniel.

4) Enforce returning 0 for fentry/fexit progs, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: broadcom: add support for BCM54811 PHY

The BCM54811 PHY shares many similarities with the already supported BCM54810
PHY but additionally requires some semi-unique configuration.

Signed-off-by: Kevin Lo <kevlo@kevlo.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'cxgb4-improve-and-tune-TC-MQPRIO-offload'

Rahul Lakkireddy says:

====================
cxgb4: improve and tune TC-MQPRIO offload

Patch 1 improves the Tx path's credit request and recovery mechanism
when running under heavy load.

Patch 2 adds ability to tune the burst buffer sizes of all traffic
classes to improve performance for <= 1500 MTU, under heavy load.

Patch 3 adds support to track EOTIDs and dump software queue
contexts used by TC-MQPRIO offload.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: add EOTID tracking and software context dump

Rework and add support for dumping EOTID software context used by
TC-MQPRIO. Also track number of EOTIDs in use.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: tune burst buffer size for TC-MQPRIO offload

For each traffic class, firmware handles up to 4 * MTU amount of data
per burst cycle. Under heavy load, this small buffer size is a
bottleneck when buffering large TSO packets in <= 1500 MTU case.
Increase the burst buffer size to 8 * MTU when supported.

Also, keep the driver's traffic class configuration API similar to
the firmware API counterpart.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

cxgb4: improve credits recovery in TC-MQPRIO Tx path

Request credit update for every half credits consumed, including
the current request. Also, avoid re-trying to post packets when there
are no credits left. The credit update reply via interrupt will
eventually restore the credits and will invoke the Tx path again.

Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-15

The following pull-request contains BPF updates for your *net-next* tree.

We've added 37 non-merge commits during the last 1 day(s) which contain
a total of 67 files changed, 741 insertions(+), 252 deletions(-).

The main changes are:

1) bpf_xdp_adjust_tail() now allows to grow the tail as well, from Jesper.

2) bpftool can probe CONFIG_HZ, from Daniel.

3) CAP_BPF is introduced to isolate user processes that use BPF infra and
to secure BPF networking services by dropping CAP_SYS_ADMIN requirement
in certain cases, from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: dsa: mt7530: fix VLAN setup

Allow DSA to add VLAN entries even if VLAN filtering is disabled, so
enabling it will not block the traffic of existent ports in the bridge

Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'Implement-classifier-action-terse-dump-mode'

Vlad Buslov says:

====================
Implement classifier-action terse dump mode

Output rate of current upstream kernel TC filter dump implementation if
relatively low (~100k rules/sec depending on configuration). This
constraint impacts performance of software switch implementation that
rely on TC for their datapath implementation and periodically call TC
filter dump to update rules stats. Moreover, TC filter dump output a lot
of static data that don't change during the filter lifecycle (filter
key, specific action details, etc.) which constitutes significant
portion of payload on resulting netlink packets and increases amount of
syscalls necessary to dump all filters on particular Qdisc. In order to
significantly improve filter dump rate this patch sets implement new
mode of TC filter dump operation named "terse dump" mode. In this mode
only parameters necessary to identify the filter (handle, action cookie,
etc.) and data that can change during filter lifecycle (filter flags,
action stats, etc.) are preserved in dump output while everything else
is omitted.

Userspace API is implemented using new TCA_DUMP_FLAGS tlv with only
available flag value TCA_DUMP_FLAGS_TERSE. Internally, new API requires
individual classifier support (new tcf_proto_ops->terse_dump()
callback). Support for action terse dump is implemented in act API and
don't require changing individual action implementations.

The following table provides performance comparison between regular
filter dump and new terse dump mode for two classifier-action profiles:
one minimal config with L2 flower classifier and single gact action and
another heavier config with L2+5tuple flower classifier with
tunnel_key+mirred actions.

Classifier-action type      |        dump |  terse dump | X improvement
                             | (rules/sec) | (rules/sec) |
-----------------------------+-------------+-------------+---------------
L2 with gact                |       141.8 |       293.2 |          2.07
L2+5tuple tunnel_key+mirred |        76.4 |       198.8 |          2.60

Benchmark details: to measure the rate tc filter dump and terse dump
commands are invoked on ingress Qdisc that have one million filters
configured using following commands.

> time sudo tc -s filter show dev ens1f0 ingress >/dev/null

> time sudo tc -s filter show terse dev ens1f0 ingress >/dev/null

Value in results table is calculated by dividing 1000000 total rules by
"real" time reported by time command.

Setup details: 2x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz, 32GB memory
====================

Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: pm: rm the right tmp file

"$err" is a variable pointing to a temp file. "$out" is not: only used
as a local variable in "check()" and representing the output of a
command line.

Fixes: 7254e10b6c6c (selftests: add PM netlink functional tests)
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>

dpaa2-eth: properly handle buffer size restrictions

Depending on the WRIOP version, the buffer size on the RX path must by a
multiple of 64 or 256. Handle this restriction properly by aligning down
the buffer size to the necessary value. Also, use the new buffer size
dynamically computed instead of the compile time one.

Fixes: e106a04f1ffa ("dpaa2-eth: Use a single page per Rx buffer")
Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: implement flower classifier terse dump tests

Implement two basic tests to verify terse dump functionality of flower
classifier:

- Test that verifies that terse dump works.

- Test that verifies that terse dump doesn't print filter key.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: cls_flower: implement terse dump support

Implement tcf_proto_ops->terse_dump() callback for flower classifier. Only
dump handle, flags and action data in terse mode.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: implement terse dump support in act

Extend tcf_action_dump() with boolean argument 'terse' that is used to
request terse-mode action dump. In terse mode only essential data needed to
identify particular action (action kind, cookie, etc.) and its stats is put
to resulting skb and everything else is omitted. Implement
tcf_exts_terse_dump() helper in cls API that is intended to be used to
request terse dump of all exts (actions) attached to the filter.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: sched: introduce terse dump flag

Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse
filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option
is intended to be used to improve performance of TC filter dump when
userland only needs to obtain stats and not the whole classifier/action
data. Extend struct tcf_proto_ops with new terse_dump() callback that must
be defined by supporting classifier implementations.

Support of the options in specific classifiers and actions is
implemented in following patches in the series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: core: recursively find netdev by device node

The assumption that a device node is associated either with the
netdev's device, or the parent of that device, does not hold for all
drivers. E.g. Freescale's DPAA has two layers of platform devices
above the netdev. Instead, recursively walk up the tree from the
netdev, allowing any parent to match against the sought after node.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'hwmon-for-v5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging

Pull hwmon fixes from Guenter Roeck:

- Fix ADC access synchronization problem with da9052 driver

- Fix temperature limit and status reporting in nct7904 driver

- Fix drivetemp temperature reporting if SCT is supported but SCT data
   tables are not.

* tag 'hwmon-for-v5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/groeck/linux-staging:
  hwmon: (da9052) Synchronize access with mfd
  hwmon: (nct7904) Fix incorrect range of temperature limit registers
  hwmon: (nct7904) Read all SMI status registers in probe function
  hwmon: (drivetemp) Fix SCT support if SCT data tables are not supported

Merge tag 'sound-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
"Things look good and calming down; the only change to ALSA core is the
  fix for racy rawmidi buffer accesses spotted by syzkaller, and the
  rest are all small device-specific quirks for HD-audio and USB-audio
  devices"

* tag 'sound-5.7-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ALSA: hda/realtek - Limit int mic boost for Thinkpad T530
  ALSA: hda/realtek - Add COEF workaround for ASUS ZenBook UX431DA
  ALSA: hda/realtek: Enable headset mic of ASUS UX581LV with ALC295
  ALSA: hda/realtek - Enable headset mic of ASUS UX550GE with ALC295
  ALSA: hda/realtek - Enable headset mic of ASUS GL503VM with ALC295
  ALSA: hda/realtek: Add quirk for Samsung Notebook
  ALSA: rawmidi: Fix racy buffer resize under concurrent accesses
  ALSA: usb-audio: add mapping for ASRock TRX40 Creator
  ALSA: hda/realtek - Fix S3 pop noise on Dell Wyse
  Revert "ALSA: hda/realtek: Fix pop noise on ALC225"
  ALSA: firewire-lib: fix 'function sizeof not defined' error of tracepoints format
  ALSA: usb-audio: Add control message quirk delay for Kingston HyperX headset

Merge tag 'drm-fixes-2020-05-15' of git://anongit.freedesktop.org/drm/drm

Pull drm fixes from Dave Airlie:
"As mentioned last week an i915 PR came in late, but I left it, so the
  i915 bits of this cover 2 weeks, which is why it's likely a bit larger
  than usual.

  Otherwise it's mostly amdgpu fixes, one tegra fix, one meson fix.

  i915:
   - Handle idling during i915_gem_evict_something busy loops (Chris)
   - Mark current submissions with a weak-dependency (Chris)
   - Propagate error from completed fences (Chris)
   - Fixes on execlist to avoid GPU hang situation (Chris)
   - Fixes couple deadlocks (Chris)
   - Timeslice preemption fixes (Chris)
   - Fix Display Port interrupt handling on Tiger Lake (Imre)
   - Reduce debug noise around Frame Buffer Compression (Peter)
   - Fix logic around IPC W/a for Coffee Lake and Kaby Lake (Sultan)
   - Avoid dereferencing a dead context (Chris)

  tegra:
   - tegra120/4 smmu fixes

  amdgpu:
   - Clockgating fixes
   - Fix fbdev with scatter/gather display
   - S4 fix for navi
   - Soft recovery for gfx10
   - Freesync fixes
   - Atomic check cursor fix
   - Add a gfxoff quirk
   - MST fix

  amdkfd:
   - Fix GEM reference counting

  meson:
   - error code propogation fix"

* tag 'drm-fixes-2020-05-15' of git://anongit.freedesktop.org/drm/drm: (29 commits)
  drm/i915: Handle idling during i915_gem_evict_something busy loops
  drm/meson: pm resume add return errno branch
  drm/amd/amdgpu: Update update_config() logic
  drm/amd/amdgpu: add raven1 part to the gfxoff quirk list
  drm/i915: Mark concurrent submissions with a weak-dependency
  drm/i915: Propagate error from completed fences
  drm/i915/gvt: Fix kernel oops for 3-level ppgtt guest
  drm/i915/gvt: Init DPLL/DDI vreg for virtual display instead of inheritance.
  drm/amd/display: add basic atomic check for cursor plane
  drm/amd/display: Fix vblank and pageflip event handling for FreeSync
  drm/amdgpu: implement soft_recovery for gfx10
  drm/amdgpu: enable hibernate support on Navi1X
  drm/amdgpu: Use GEM obj reference for KFD BOs
  drm/amdgpu: force fbdev into vram
  drm/amd/powerplay: perform PG ungate prior to CG ungate
  drm/amdgpu: drop unnecessary cancel_delayed_work_sync on PG ungate
  drm/amdgpu: disable MGCG/MGLS also on gfx CG ungate
  drm/i915/execlists: Track inflight CCID
  drm/i915/execlists: Avoid reusing the same logical CCID
  drm/i915/gem: Remove object_is_locked assertion from unpin_from_display_plane
  ...

Merge branch 'bpf-cap'

Alexei Starovoitov says:

====================
v6->v7:
- permit SK_REUSEPORT program type under CAP_BPF as suggested by Marek Majkowski.
  It's equivalent to SOCKET_FILTER which is unpriv.

v5->v6:
- split allow_ptr_leaks into four flags.
- retain bpf_jit_limit under cap_sys_admin.
- fixed few other issues spotted by Daniel.

v4->v5:

Split BPF operations that are allowed under CAP_SYS_ADMIN into combination of
CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN and keep some of them under CAP_SYS_ADMIN.

The user process has to have
- CAP_BPF to create maps, do other sys_bpf() commands and load SK_REUSEPORT progs.
  Note: dev_map, sock_hash, sock_map map types still require CAP_NET_ADMIN.
  That could be relaxed in the future.
- CAP_BPF and CAP_PERFMON to load tracing programs.
- CAP_BPF and CAP_NET_ADMIN to load networking programs.
(or CAP_SYS_ADMIN for backward compatibility).

CAP_BPF solves three main goals:
1. provides isolation to user space processes that drop CAP_SYS_ADMIN and switch to CAP_BPF.
   More on this below. This is the major difference vs v4 set back from Sep 2019.
2. makes networking BPF progs more secure, since CAP_BPF + CAP_NET_ADMIN
   prevents pointer leaks and arbitrary kernel memory access.
3. enables fuzzers to exercise all of the verifier logic. Eventually finding bugs
   and making BPF infra more secure. Currently fuzzers run in unpriv.
   They will be able to run with CAP_BPF.

The patchset is long overdue follow-up from the last plumbers conference.
Comparing to what was discussed at LPC the CAP* checks at attach time are gone.
For tracing progs the CAP_SYS_ADMIN check was done at load time only. There was
no check at attach time. For networking and cgroup progs CAP_SYS_ADMIN was
required at load time and CAP_NET_ADMIN at attach time, but there are several
ways to bypass CAP_NET_ADMIN:
- if networking prog is using tail_call writing FD into prog_array will
  effectively attach it, but bpf_map_update_elem is an unprivileged operation.
- freplace prog with CAP_SYS_ADMIN can replace networking prog

Consolidating all CAP checks at load time makes security model similar to
open() syscall. Once the user got an FD it can do everything with it.
read/write/poll don't check permissions. The same way when bpf_prog_load
command returns an FD the user can do everything (including attaching,
detaching, and bpf_test_run).

The important design decision is to allow ID->FD transition for
CAP_SYS_ADMIN only. What it means that user processes can run
with CAP_BPF and CAP_NET_ADMIN and they will not be able to affect each
other unless they pass FDs via scm_rights or via pinning in bpffs.
ID->FD is a mechanism for human override and introspection.
An admin can do 'sudo bpftool prog ...'. It's possible to enforce via LSM that
only bpftool binary does bpf syscall with CAP_SYS_ADMIN and the rest of user
space processes do bpf syscall with CAP_BPF isolating bpf objects (progs, maps,
links) that are owned by such processes from each other.

Another significant change from LPC is that the verifier checks are split into
four flags. The allow_ptr_leaks flag allows pointer manipulations. The
bpf_capable flag enables all modern verifier features like bpf-to-bpf calls,
BTF, bounded loops, dead code elimination, etc. All the goodness. The
bypass_spec_v1 flag enables indirect stack access from bpf programs and
disables speculative analysis and bpf array mitigations. The bypass_spec_v4
flag disables store sanitation. That allows networking progs with CAP_BPF +
CAP_NET_ADMIN enjoy modern verifier features while being more secure.

Some networking progs may need CAP_BPF + CAP_NET_ADMIN + CAP_PERFMON,
since subtracting pointers (like skb->data_end - skb->data) is a pointer leak,
but the verifier may get smarter in the future.
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>

selftests/bpf: Use CAP_BPF and CAP_PERFMON in tests

Make all test_verifier test exercise CAP_BPF and CAP_PERFMON

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-4-alexei.starovoitov@gmail.com

bpf: Implement CAP_BPF

Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
  env->allow_ptr_leaks = bpf_allow_ptr_leaks();
  env->bypass_spec_v1 = bpf_bypass_spec_v1();
  env->bypass_spec_v4 = bpf_bypass_spec_v4();
  env->bpf_capable = bpf_capable();

The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.

'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com

bpf, capability: Introduce CAP_BPF

Split BPF operations that are allowed under CAP_SYS_ADMIN into
combination of CAP_BPF, CAP_PERFMON, CAP_NET_ADMIN.
For backward compatibility include them in CAP_SYS_ADMIN as well.

The end result provides simple safety model for applications that use BPF:
- to load tracing program types
  BPF_PROG_TYPE_{KPROBE, TRACEPOINT, PERF_EVENT, RAW_TRACEPOINT, etc}
  use CAP_BPF and CAP_PERFMON
- to load networking program types
  BPF_PROG_TYPE_{SCHED_CLS, XDP, SK_SKB, etc}
  use CAP_BPF and CAP_NET_ADMIN

There are few exceptions from this rule:
- bpf_trace_printk() is allowed in networking programs, but it's using
  tracing mechanism, hence this helper needs additional CAP_PERFMON
  if networking program is using this helper.
- BPF_F_ZERO_SEED flag for hash/lru map is allowed under CAP_SYS_ADMIN only
  to discourage production use.
- BPF HW offload is allowed under CAP_SYS_ADMIN.
- bpf_probe_write_user() is allowed under CAP_SYS_ADMIN only.

CAPs are not checked at attach/detach time with two exceptions:
- loading BPF_PROG_TYPE_CGROUP_SKB is allowed for unprivileged users,
  hence CAP_NET_ADMIN is required at attach time.
- flow_dissector detach doesn't check prog FD at detach,
  hence CAP_NET_ADMIN is required at detach time.

CAP_SYS_ADMIN is required to iterate BPF objects (progs, maps, links) via get_next_id
command and convert them to file descriptor via GET_FD_BY_ID command.
This restriction guarantees that mutliple tasks with CAP_BPF are not able to
affect each other. That leads to clean isolation of tasks. For example:
task A with CAP_BPF and CAP_NET_ADMIN loads and attaches a firewall via bpf_link.
task B with the same capabilities cannot detach that firewall unless
task A explicitly passed link FD to task B via scm_rights or bpffs.
CAP_SYS_ADMIN can still detach/unload everything.

Two networking user apps with CAP_SYS_ADMIN and CAP_NET_ADMIN can
accidentely mess with each other programs and maps.
Two networking user apps with CAP_NET_ADMIN and CAP_BPF cannot affect each other.

CAP_NET_ADMIN + CAP_BPF allows networking programs access only packet data.
Such networking progs cannot access arbitrary kernel memory or leak pointers.

bpftool, bpftrace, bcc tools binaries should NOT be installed with
CAP_BPF and CAP_PERFMON, since unpriv users will be able to read kernel secrets.
But users with these two permissions will be able to use these tracing tools.

CAP_PERFMON is least secure, since it allows kprobes and kernel memory access.
CAP_NET_ADMIN can stop network traffic via iproute2.
CAP_BPF is the safest from security point of view and harmless on its own.

Having CAP_BPF and/or CAP_NET_ADMIN is not enough to write into arbitrary map
and if that map is used by firewall-like bpf prog.
CAP_BPF allows many bpf prog_load commands in parallel. The verifier
may consume large amount of memory and significantly slow down the system.

Existing unprivileged BPF operations are not affected.
In particular unprivileged users are allowed to load socket_filter and cg_skb
program types and to create array, hash, prog_array, map-in-map map types.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-2-alexei.starovoitov@gmail.com

bpf, bpftool: Allow probing for CONFIG_HZ from kernel config

In Cilium we've recently switched to make use of bpf_jiffies64() for
parts of our tc and XDP datapath since bpf_ktime_get_ns() is more
expensive and high-precision is not needed for our timeouts we have
anyway. Our agent has a probe manager which picks up the json of
bpftool's feature probe and we also use the macro output in our C
programs e.g. to have workarounds when helpers are not available on
older kernels.

Extend the kernel config info dump to also include the kernel's
CONFIG_HZ, and rework the probe_kernel_image_config() for allowing a
macro dump such that CONFIG_HZ can be propagated to BPF C code as a
simple define if available via config. Latter allows to have _compile-
time_ resolution of jiffies <-> sec conversion in our code since all
are propagated as known constants.

Given we cannot generally assume availability of kconfig everywhere,
we also have a kernel hz probe [0] as a fallback. Potentially, bpftool
could have an integrated probe fallback as well, although to derive it,
we might need to place it under 'bpftool feature probe full' or similar
given it would slow down the probing process overall. Yet 'full' doesn't
fit either for us since we don't want to pollute the kernel log with
warning messages from bpf_probe_write_user() and bpf_trace_printk() on
agent startup; I've left it out for the time being.

[0] https://github.com/cilium/cilium/blob/master/bpf/cilium-probe-kernel-hz.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Quentin Monnet <quentin@isovalent.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200513075849.20868-1-daniel@iogearbox.net

Merge branch 'restrict-bpf_probe_read'

Daniel Borkmann says:

====================
Small set of fixes in order to restrict BPF helpers for tracing which are
broken on archs with overlapping address ranges as per discussion in [0].
I've targetted this for -bpf tree so they can be routed as fixes. Thanks!

v1 -> v2:
  - switch to reusable %pks, %pus format specifiers (Yonghong)
    - fixate %s on kernel_ds probing for archs with overlapping addr space

      [0] https://lore.kernel.org/bpf/CAHk-=wjJKo0GVixYLmqPn-Q22WFu0xHaBSjKEo7e7Yw72y5SPQ@mail.gmail.com/T/
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier

Usage of plain %s conversion specifier in bpf_trace_printk() suffers from the
very same issue as bpf_probe_read{,str}() helpers, that is, it is broken on
archs with overlapping address ranges.

While the helpers have been addressed through work in 6cff04cfa39c ("bpf: Add
probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers"), we need
an option for bpf_trace_printk() as well to fix it.

Similarly as with the helpers, force users to make an explicit choice by adding
%pks and %pus specifier to bpf_trace_printk() which will then pick the corresponding
strncpy_from_unsafe*() variant to perform the access under KERNEL_DS or USER_DS.
The %pk* (kernel specifier) and %pu* (user specifier) can later also be extended
for other objects aside strings that are probed and printed under tracing, and
reused out of other facilities like bpf_seq_printf() or BTF based type printing.

Existing behavior of %s for current users is still kept working for archs where it
is not broken and therefore gated through CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
For archs not having this property we fall-back to pick probing under KERNEL_DS as
a sensible default.

Fixes: fced32942571 ("bpf: add support for %s specifier to bpf_trace_printk()")
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-4-daniel@iogearbox.net

bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range

Given bpf_probe_read{,str}() BPF helpers are now only available under
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE, we need to add the drop-in
replacements of bpf_probe_read_{kernel,user}_str() to do_refine_retval_range()
as well to avoid hitting the same issue as in d055298bb70d9 ("bpf/verifier:
refine retval R0 state for bpf_get_stack helper").

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200515101118.6508-3-daniel@iogearbox.net

bpf: Restrict bpf_probe_read{, str}() only to archs where they work

Given the legacy bpf_probe_read{,str}() BPF helpers are broken on archs
with overlapping address ranges, we should really take the next step to
disable them from BPF use there.

To generally fix the situation, we've recently added new helper variants
bpf_probe_read_{user,kernel}() and bpf_probe_read_{user,kernel}_str().
For details on them, see 6cff04cfa39c ("bpf: Add probe_read_{user, kernel}
and probe_read_{user,kernel}_str helpers").

Given bpf_probe_read{,str}() have been around for ~5 years by now, there
are plenty of users at least on x86 still relying on them today, so we
cannot remove them entirely w/o breaking the BPF tracing ecosystem.

However, their use should be restricted to archs with non-overlapping
address ranges where they are working in their current form. Therefore,
move this behind a CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE and
have x86, arm64, arm select it (other archs supporting it can follow-up
on it as well).

For the remaining archs, they can workaround easily by relying on the
feature probe from bpftool which spills out defines that can be used out
of BPF C code to implement the drop-in replacement for old/new kernels
via: bpftool feature probe macro

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Masami Hiramatsu <mhiramat@kernel.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Brendan Gregg <brendan.d.gregg@gmail.com>
Cc: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/bpf/20200515101118.6508-2-daniel@iogearbox.net

Merge tag 'drm-misc-fixes-2020-05-14' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes

Just one meson patch this time to propagate an error code

Signed-off-by: Dave Airlie <airlied@redhat.com>
From: Maxime Ripard <maxime@cerno.tech>
Link: https://patchwork.freedesktop.org/patch/msgid/20200514073538.wvdtv5s2mt4wdrdj@gilmour.lan

Merge branch 'xdp-grow-tail'

Jesper Dangaard Brouer says:

====================
V4:
- Fixup checkpatch.pl issues
- Collected more ACKs

V3:
- Fix issue on virtio_net patch spotted by Jason Wang
- Adjust name for variable in mlx5 patch
- Collected more ACKs

V2:
- Fix bug in mlx5 for XDP_PASS case
- Collected nitpicks and ACKs from mailing list

V1:
- Fix bug in dpaa2

XDP have evolved to support several frame sizes, but xdp_buff was not
updated with this information. This have caused the side-effect that
XDP frame data hard end is unknown. This have limited the BPF-helper
bpf_xdp_adjust_tail to only shrink the packet. This patchset address
this and add packet tail extend/grow.

The purpose of the patchset is ALSO to reserve a memory area that can be
used for storing extra information, specifically for extending XDP with
multi-buffer support. One proposal is to use same layout as
skb_shared_info, which is why this area is currently 320 bytes.

When converting xdp_frame to SKB (veth and cpumap), the full tailroom
area can now be used and SKB truesize is now correct. For most
drivers this result in a much larger tailroom in SKB "head" data
area. The network stack can now take advantage of this when doing SKB
coalescing. Thus, a good driver test is to use xdp_redirect_cpu from
samples/bpf/ and do some TCP stream testing.

Use-cases for tail grow/extend:
(1) IPsec / XFRM needs a tail extend[1][2].
(2) DNS-cache responses in XDP.
(3) HAProxy ALOHA would need it to convert to XDP.
(4) Add tail info e.g. timestamp and collect via tcpdump

[1] http://vger.kernel.org/netconf2019_files/xfrm_xdp.pdf
[2] http://vger.kernel.org/netconf2019.html

Examples on howto access the tail area of an XDP packet is shown in the
XDP-tutorial example[3].

[3] https://github.com/xdp-project/xdp-tutorial/blob/master/experiment01-tailgrow/
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Xdp_adjust_tail add grow tail tests

Extend BPF selftest xdp_adjust_tail with grow tail tests, which is added
as subtest's. The first grow test stays in same form as original shrink
test. The second grow test use the newer bpf_prog_test_run_xattr() calls,
and does extra checking of data contents.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945350567.97035.9632611946765811876.stgit@firesoul

selftests/bpf: Adjust BPF selftest for xdp_adjust_tail

Current selftest for BPF-helper xdp_adjust_tail only shrink tail.
Make it more clear that this is a shrink test case.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945350058.97035.17280775016196207372.stgit@firesoul

bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp().

Update the memory requirements, when adding xdp.frame_sz in BPF test_run
function bpf_prog_test_run_xdp() which e.g. is used by XDP selftests.

Specifically add the expected reserved tailroom, but also allocated a
larger memory area to reflect that XDP frames usually comes in this
format. Limit the provided packet data size to 4096 minus headroom +
tailroom, as this also reflect a common 3520 bytes MTU limit with XDP.

Note that bpf_test_init already use a memory allocation method that clears
memory. Thus, this already guards against leaking uninit kernel memory.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945349549.97035.15316291762482444006.stgit@firesoul

xdp: Clear grow memory in bpf_xdp_adjust_tail()

Clearing memory of tail when grow happens, because it is too easy
to write a XDP_PASS program that extend the tail, which expose
this memory to users that can run tcpdump.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945349039.97035.5262100484553494.stgit@firesoul

xdp: Allow bpf_xdp_adjust_tail() to grow packet size

Finally, after all drivers have a frame size, allow BPF-helper
bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.

Remember that helper/macro xdp_data_hard_end have reserved some
tailroom. Thus, this helper makes sure that the BPF-prog don't have
access to this tailroom area.

V2: Remove one chicken check and use WARN_ONCE for other

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945348530.97035.12577148209134239291.stgit@firesoul

mlx5: Rx queue setup time determine frame_sz for XDP

The mlx5 driver have multiple memory models, which are also changed
according to whether a XDP bpf_prog is attached.

The 'rx_striding_rq' setting is adjusted via ethtool priv-flags e.g.:
# ethtool --set-priv-flags mlx5p2 rx_striding_rq off

On the general case with 4K page_size and regular MTU packet, then
the frame_sz is 2048 and 4096 when XDP is enabled, in both modes.

The info on the given frame size is stored differently depending on the
RQ-mode and encoded in a union in struct mlx5e_rq union wqe/mpwqe.
In rx striding mode rq->mpwqe.log_stride_sz is either 11 or 12, which
corresponds to 2048 or 4096 (MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ).
In non-striding mode (MLX5_WQ_TYPE_CYCLIC) the frag_stride is stored
in rq->wqe.info.arr[0].frag_stride, for the first fragment, which is
what the XDP case cares about.

To reduce effect on fast-path, this patch determine the frame_sz at
setup time, to avoid determining the memory model runtime. Variable
is named frame0_sz to make it clear that this is only the frame
size of the first fragment.

This mlx5 driver does a DMA-sync on XDP_TX action, but grow is safe
as it have done a DMA-map on the entire PAGE_SIZE. The driver also
already does a XDP length check against sq->hw_mtu on the possible
XDP xmit paths mlx5e_xmit_xdp_frame() + mlx5e_xmit_xdp_frame_mpwqe().

V3+4: Change variable name first_frame_sz to frame0_sz

V2: Fix that frag_size need to be recalc before creating SKB.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Link: https://lore.kernel.org/bpf/158945348021.97035.12295039384250022883.stgit@firesoul

xdp: For Intel AF_XDP drivers add XDP frame_sz

Intel drivers implement native AF_XDP zerocopy in separate C-files,
that have its own invocation of bpf_prog_run_xdp(). The setup of
xdp_buff is also handled in separately from normal code path.

This patch update XDP frame_sz for AF_XDP zerocopy drivers i40e, ice
and ixgbe, as the code changes needed are very similar. Introduce a
helper function xsk_umem_xdp_frame_sz() for calculating frame size.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/158945347511.97035.8536753731329475655.stgit@firesoul

ice: Add XDP frame size to driver

This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".

For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.

The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).

There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Link: https://lore.kernel.org/bpf/158945347002.97035.328088795813704587.stgit@firesoul

i40e: Add XDP frame size to driver

This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".

For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.

The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).

There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Link: https://lore.kernel.org/bpf/158945346494.97035.12809400414566061815.stgit@firesoul

ixgbevf: Add XDP frame size to VF driver

This patch mirrors the changes to ixgbe in previous patch.

This VF driver doesn't support XDP_REDIRECT, but correct tailroom is
still necessary for BPF-helper xdp_adjust_tail. In legacy-mode +
larger PAGE_SIZE, due to lacking tailroom, we accept that
xdp_adjust_tail shrink doesn't work.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Link: https://lore.kernel.org/bpf/158945345984.97035.13518286183248025173.stgit@firesoul

ixgbe: Add XDP frame size to driver

This driver uses different memory models depending on PAGE_SIZE at
compile time. For PAGE_SIZE 4K it uses page splitting, meaning for
normal MTU frame size is 2048 bytes (and headroom 192 bytes). For
larger MTUs the driver still use page splitting, by allocating
order-1 pages (8192 bytes) for RX frames. For PAGE_SIZE larger than
4K, driver instead advance its rx_buffer->page_offset with the frame
size "truesize".

For XDP frame size calculations, this mean that in PAGE_SIZE larger
than 4K mode the frame_sz change on a per packet basis. For the page
split 4K PAGE_SIZE mode, xdp.frame_sz is more constant and can be
updated once outside the main NAPI loop.

The default setting in the driver uses build_skb(), which provides
the necessary headroom and tailroom for XDP-redirect in RX-frame
(in both modes).

There is one complication, which is legacy-rx mode (configurable via
ethtool priv-flags). There are zero headroom in this mode, which is a
requirement for XDP-redirect to work. The conversion to xdp_frame
(convert_to_xdp_frame) will detect this insufficient space, and
xdp_do_redirect() call will fail. This is deemed acceptable, as it
allows other XDP actions to still work in legacy-mode. In
legacy-mode + larger PAGE_SIZE due to lacking tailroom, we also
accept that xdp_adjust_tail shrink doesn't work.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Cc: Alexander Duyck <alexander.duyck@gmail.com>
Link: https://lore.kernel.org/bpf/158945345455.97035.14334355929030628741.stgit@firesoul

ixgbe: Fix XDP redirect on archs with PAGE_SIZE above 4K

The ixgbe driver have another memory model when compiled on archs with
PAGE_SIZE above 4096 bytes. In this mode it doesn't split the page in
two halves, but instead increment rx_buffer->page_offset by truesize of
packet (which include headroom and tailroom for skb_shared_info).

This is done correctly in ixgbe_build_skb(), but in ixgbe_rx_buffer_flip
which is currently only called on XDP_TX and XDP_REDIRECT, it forgets
to add the tailroom for skb_shared_info. This breaks XDP_REDIRECT, for
veth and cpumap. Fix by adding size of skb_shared_info tailroom.

Maintainers notice: This fix have been queued to Jeff.

Fixes: e59db4a1dd47 ("ixgbe: add initial support for xdp redirect")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Link: https://lore.kernel.org/bpf/158945344946.97035.17031588499266605743.stgit@firesoul

virtio_net: Add XDP frame size in two code paths

The virtio_net driver is running inside the guest-OS. There are two
XDP receive code-paths in virtio_net, namely receive_small() and
receive_mergeable(). The receive_big() function does not support XDP.

In receive_small() the frame size is available in buflen. The buffer
backing these frames are allocated in add_recvbuf_small() with same
size, except for the headroom, but tailroom have reserved room for
skb_shared_info. The headroom is encoded in ctx pointer as a value.

In receive_mergeable() the frame size is more dynamic. There are two
basic cases: (1) buffer size is based on a exponentially weighted
moving average (see DECLARE_EWMA) of packet length. Or (2) in case
virtnet_get_headroom() have any headroom then buffer size is
PAGE_SIZE. The ctx pointer is this time used for encoding two values;
the buffer len "truesize" and headroom. In case (1) if the rx buffer
size is underestimated, the packet will have been split over more
buffers (num_buf info in virtio_net_hdr_mrg_rxbuf placed in top of
buffer area). If that happens the XDP path does a xdp_linearize_page
operation.

V3: Adjust frame_sz in receive_mergeable() case, spotted by Jason Wang.

The code is really hard to follow, so some hints to reviewers.
The receive_mergeable() case gets frames that were allocated in
add_recvbuf_mergeable() which uses headroom=virtnet_get_headroom(),
and 'buf' ptr is advanced this headroom.  The headroom can only
be 0 or VIRTIO_XDP_HEADROOM, as virtnet_get_headroom is really
simple:

  static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
  {
return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0;
  }

As frame_sz is an offset size from xdp.data_hard_start, reviewers
should notice how this is calculated in receive_mergeable():

  int offset = buf - page_address(page);
  [...]
  data = page_address(xdp_page) + offset;
  xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;

The calculated offset will always be VIRTIO_XDP_HEADROOM when
reaching this code.  Thus, xdp.data_hard_start will be page-start
address plus vi->hdr_len.  Given this xdp.frame_sz need to be
reduced with vi->hdr_len size.

IMHO a followup patch should cleanup this code to make it easier
to maintain and understand, but it is outside the scope of this
patchset.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/bpf/158945344436.97035.9445115070189151680.stgit@firesoul

vhost_net: Also populate XDP frame size

In vhost_net_build_xdp() the 'buf' that gets queued via an xdp_buff
have embedded a struct tun_xdp_hdr (located at xdp->data_hard_start)
which contains the buffer length 'buflen' (with tailroom for
skb_shared_info). Also storing this buflen in xdp->frame_sz, does not
obsolete struct tun_xdp_hdr, as it also contains a struct
virtio_net_hdr with other information.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/bpf/158945343928.97035.4620233649151726289.stgit@firesoul

tun: Add XDP frame size

The tun driver have two code paths for running XDP (bpf_prog_run_xdp).
In both cases 'buflen' contains enough tailroom for skb_shared_info.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Link: https://lore.kernel.org/bpf/158945343419.97035.9594485183958037621.stgit@firesoul