]> git.baikalelectronics.ru Git - kernel.git/log
kernel.git
4 years agomptcp: allow collapsing consecutive sendpages on the same substream
Paolo Abeni [Wed, 22 Jan 2020 00:56:27 +0000 (16:56 -0800)]
mptcp: allow collapsing consecutive sendpages on the same substream

If the current sendmsg() lands on the same subflow we used last, we
can try to collapse the data.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: recvmsg() can drain data from multiple subflows
Paolo Abeni [Wed, 22 Jan 2020 00:56:26 +0000 (16:56 -0800)]
mptcp: recvmsg() can drain data from multiple subflows

With the previous patch in place, the msk can detect which subflow
has the current map with a simple walk, let's update the main
loop to always select the 'current' subflow. The exit conditions now
closely mirror tcp_recvmsg() to get expected timeout and signal
behavior.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: add subflow write space signalling and mptcp_poll
Florian Westphal [Wed, 22 Jan 2020 00:56:25 +0000 (16:56 -0800)]
mptcp: add subflow write space signalling and mptcp_poll

Add new SEND_SPACE flag to indicate that a subflow has enough space to
accept more data for transmission.

It gets cleared at the end of mptcp_sendmsg() in case ssk has run
below the free watermark.

It is (re-set) from the wspace callback.

This allows us to use msk->flags to determine the poll mask.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Implement MPTCP receive path
Mat Martineau [Wed, 22 Jan 2020 00:56:24 +0000 (16:56 -0800)]
mptcp: Implement MPTCP receive path

Parses incoming DSS options and populates outgoing MPTCP ACK
fields. MPTCP fields are parsed from the TCP option header and placed in
an skb extension, allowing the upper MPTCP layer to access MPTCP
options after the skb has gone through the TCP stack.

The subflow implements its own data_ready() ops, which ensures that the
pending data is in sequence - according to MPTCP seq number - dropping
out-of-seq skbs. The DATA_READY bit flag is set if this is the case.
This allows the MPTCP socket layer to determine if more data is
available without having to consult the individual subflows.

It additionally validates the current mapping and propagates EoF events
to the connection socket.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Write MPTCP DSS headers to outgoing data packets
Mat Martineau [Wed, 22 Jan 2020 00:56:23 +0000 (16:56 -0800)]
mptcp: Write MPTCP DSS headers to outgoing data packets

Per-packet metadata required to write the MPTCP DSS option is written to
the skb_ext area. One write to the socket may contain more than one
packet of data, which is copied to page fragments and mapped in to MPTCP
DSS segments with size determined by the available page fragments and
the maximum mapping length allowed by the MPTCP specification. If
do_tcp_sendpages() splits a DSS segment in to multiple skbs, that's ok -
the later skbs can either have duplicated DSS mapping information or
none at all, and the receiver can handle that.

The current implementation uses the subflow frag cache and tcp
sendpages to avoid excessive code duplication. More work is required to
ensure that it works correctly under memory pressure and to support
MPTCP-level retransmissions.

The MPTCP DSS checksum is not yet implemented.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Add setsockopt()/getsockopt() socket operations
Peter Krystad [Wed, 22 Jan 2020 00:56:22 +0000 (16:56 -0800)]
mptcp: Add setsockopt()/getsockopt() socket operations

set/getsockopt behaviour with multiple subflows is undefined.
Therefore, for now, we return -EOPNOTSUPP unless we're in fallback mode.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Add shutdown() socket operation
Peter Krystad [Wed, 22 Jan 2020 00:56:21 +0000 (16:56 -0800)]
mptcp: Add shutdown() socket operation

Call shutdown on all subflows in use on the given socket, or on the
fallback socket.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Add key generation and token tree
Peter Krystad [Wed, 22 Jan 2020 00:56:20 +0000 (16:56 -0800)]
mptcp: Add key generation and token tree

Generate the local keys, IDSN, and token when creating a new socket.
Introduce the token tree to track all tokens in use using a radix tree
with the MPTCP token itself as the index.

Override the rebuild_header callback in inet_connection_sock_af_ops for
creating the local key on a new outgoing connection.

Override the init_req callback of tcp_request_sock_ops for creating the
local key on a new incoming connection.

Will be used to obtain the MPTCP parent socket to handle incoming joins.

Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Create SUBFLOW socket for incoming connections
Peter Krystad [Wed, 22 Jan 2020 00:56:19 +0000 (16:56 -0800)]
mptcp: Create SUBFLOW socket for incoming connections

Add subflow_request_sock type that extends tcp_request_sock
and add an is_mptcp flag to tcp_request_sock distinguish them.

Override the listen() and accept() methods of the MPTCP
socket proto_ops so they may act on the subflow socket.

Override the conn_request() and syn_recv_sock() handlers
in the inet_connection_sock to handle incoming MPTCP
SYNs and the ACK to the response SYN.

Add handling in tcp_output.c to add MP_CAPABLE to an outgoing
SYN-ACK response for a subflow_request_sock.

Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Handle MP_CAPABLE options for outgoing connections
Peter Krystad [Wed, 22 Jan 2020 00:56:18 +0000 (16:56 -0800)]
mptcp: Handle MP_CAPABLE options for outgoing connections

Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
the final ACK of the three-way handshake.

Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
responding SYN-ACK is received and notify the MPTCP connection layer.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Associate MPTCP context with TCP socket
Peter Krystad [Wed, 22 Jan 2020 00:56:17 +0000 (16:56 -0800)]
mptcp: Associate MPTCP context with TCP socket

Use ULP to associate a subflow_context structure with each TCP subflow
socket. Creating these sockets requires new bind and connect functions
to make sure ULP is set up immediately when the subflow sockets are
created.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Handle MPTCP TCP options
Peter Krystad [Wed, 22 Jan 2020 00:56:16 +0000 (16:56 -0800)]
mptcp: Handle MPTCP TCP options

Add hooks to parse and format the MP_CAPABLE option.

This option is handled according to MPTCP version 0 (RFC6824).
MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in
coordination with related code changes.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agomptcp: Add MPTCP socket stubs
Mat Martineau [Wed, 22 Jan 2020 00:56:15 +0000 (16:56 -0800)]
mptcp: Add MPTCP socket stubs

Implements the infrastructure for MPTCP sockets.

MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
sockets are only managed by the MPTCP socket that owns them and are not
visible from userspace. This commit allows a userspace program to open
an MPTCP socket with:

  sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

The resulting socket is simply a wrapper around a single regular TCP
socket, without any of the MPTCP protocol implemented over the wire.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-bridge-add-per-vlan-state-option'
David S. Miller [Fri, 24 Jan 2020 11:58:14 +0000 (12:58 +0100)]
Merge branch 'net-bridge-add-per-vlan-state-option'

Nikolay Aleksandrov says:

====================
net: bridge: add per-vlan state option

This set adds the first per-vlan option - state, which uses the new vlan
infrastructure that was recently added. It gives us forwarding control on
per-vlan basis. The first 3 patches prepare the vlan code to support option
dumping and modification. We still compress vlan ranges which have equal
options, each new option will have to add its own equality check to
br_vlan_opts_eq(). The vlans are created in forwarding state by default to
be backwards compatible and vlan state is considered only when the port
state is forwarding (more info in patch 4).
I'll send the selftest for the vlan state with the iproute2 patch-set.

v2: patch 3: do full (all-vlan) notification only on vlan
    create/delete, otherwise use the per-vlan notifications only,
    rework how option change ranges are detected, add more verbose error
    messages when setting options and add checks if a vlan should be used.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: bridge: vlan: add per-vlan state
Nikolay Aleksandrov [Fri, 24 Jan 2020 11:40:22 +0000 (13:40 +0200)]
net: bridge: vlan: add per-vlan state

The first per-vlan option added is state, it is needed for EVPN and for
per-vlan STP. The state allows to control the forwarding on per-vlan
basis. The vlan state is considered only if the port state is forwarding
in order to avoid conflicts and be consistent. br_allowed_egress is
called only when the state is forwarding, but the ingress case is a bit
more complicated due to the fact that we may have the transition between
port:BR_STATE_FORWARDING -> vlan:BR_STATE_LEARNING which should still
allow the bridge to learn from the packet after vlan filtering and it will
be dropped after that. Also to optimize the pvid state check we keep a
copy in the vlan group to avoid one lookup. The state members are
modified with *_ONCE() to annotate the lockless access.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: bridge: vlan: add basic option setting support
Nikolay Aleksandrov [Fri, 24 Jan 2020 11:40:21 +0000 (13:40 +0200)]
net: bridge: vlan: add basic option setting support

This patch adds support for option modification of single vlans and
ranges. It allows to only modify options, i.e. skip create/delete by
using the BRIDGE_VLAN_INFO_ONLY_OPTS flag. When working with a range
option changes we try to pack the notifications as much as possible.

v2: do full port (all vlans) notification only when creating/deleting
    vlans for compatibility, rework the range detection when changing
    options, add more verbose extack errors and check if a vlan should
    be used (br_vlan_should_use checks)

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: bridge: vlan: add basic option dumping support
Nikolay Aleksandrov [Fri, 24 Jan 2020 11:40:20 +0000 (13:40 +0200)]
net: bridge: vlan: add basic option dumping support

We'll be dumping the options for the whole range if they're equal. The
first range vlan will be used to extract the options. The commit doesn't
change anything yet it just adds the skeleton for the support. The dump
will happen when the first option is added.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: bridge: check port state before br_allowed_egress
Nikolay Aleksandrov [Fri, 24 Jan 2020 11:40:19 +0000 (13:40 +0200)]
net: bridge: check port state before br_allowed_egress

If we make sure that br_allowed_egress is called only when we have
BR_STATE_FORWARDING state then we can avoid a test later when we add
per-vlan state.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge tag 'mlx5-updates-2020-01-22' of git://git.kernel.org/pub/scm/linux/kernel...
David S. Miller [Thu, 23 Jan 2020 20:27:23 +0000 (21:27 +0100)]
Merge tag 'mlx5-updates-2020-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-01-22

This series provides updates to mlx5 driver.
1) Misc small cleanups
2) Some SW steering updates including header copy support
3) Full ethtool statistics support for E-Switch uplink representor
Some refactoring was required to share the bare-metal NIC ethtool
stats with the Uplink representor. On Top of this Vlad converts the
ethtool stats support in E-Swtich vports representors to use the mlx5e
"stats groups" infrastructure and then applied all applicable stats
to the uplink representor netdev.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'Add-PHY-IDs-for-DP83825-6'
David S. Miller [Thu, 23 Jan 2020 20:21:12 +0000 (21:21 +0100)]
Merge branch 'Add-PHY-IDs-for-DP83825-6'

Dan Murphy says:

====================
Add PHY IDs for DP83825/6

Adding new PHY IDs for the DP83825 and DP83826 TI Ethernet PHYs to the DP83822
PHY driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: DP83822: Add support for additional DP83825 devices
Dan Murphy [Wed, 22 Jan 2020 15:34:55 +0000 (09:34 -0600)]
net: phy: DP83822: Add support for additional DP83825 devices

Add PHY IDs for the DP83825CS, DP83825CM and the DP83825S devices to the
DP83822 driver.

Signed-off-by: Dan Murphy <dmurphy@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agophy: dp83826: Add phy IDs for DP83826N and 826NC
Dan Murphy [Wed, 22 Jan 2020 15:34:54 +0000 (09:34 -0600)]
phy: dp83826: Add phy IDs for DP83826N and 826NC

Add phy IDs to the DP83822 phy driver for the DP83826N
and the DP83826NC devices.  The register map and features
are the same for basic enablement.

Signed-off-by: Dan Murphy <dmurphy@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'net-sched-add-Flow-Queue-PIE-packet-scheduler'
David S. Miller [Thu, 23 Jan 2020 10:38:31 +0000 (11:38 +0100)]
Merge branch 'net-sched-add-Flow-Queue-PIE-packet-scheduler'

Gautam Ramakrishnan says:

====================
net: sched: add Flow Queue PIE packet scheduler

Flow Queue PIE packet scheduler

This patch series implements the Flow Queue Proportional
Integral controller Enhanced (FQ-PIE) active queue
Management algorithm. It is an enhancement over the PIE
algorithm. It integrates the PIE aqm with a deficit round robin
scheme.

FQ-PIE is implemented over the latest version of PIE which
uses timestamps to calculate queue delay with an additional
option of using average dequeue rate to calculate the queue
delay. This patch also adds a memory limit of all the packets
across all queues to a default value of 32Mb.

 - Patch #1
   - Creates pie.h and moves all small functions and structures
     common to PIE and FQ-PIE here. The functions are all made
     inline.
 - Patch #2 - #8
   - Addresses code formatting, indentation, comment changes
     and rearrangement of structure members.
 - Patch #9
   - Refactors sch_pie.c by changing arguments to
     calculate_probability(), [pie_]drop_early() and
     pie_process_dequeue() to make it generic enough to
     be used by sch_fq_pie.c. These functions are exported
     to be used by sch_fq_pie.c.
 - Patch #10
   - Adds the FQ-PIE Qdisc.

For more information:
https://tools.ietf.org/html/rfc8033

Changes from v6 to v7
 - Call tcf_block_put() when destroying the Qdisc as suggested
   by Jakub Kicinski.

Changes from v5 to v6
 - Rearranged struct members according to their access pattern
   and to remove holes.

Changes from v4 to v5
 - This patch series breaks down patch 1 of v4 into
   separate logical commits as suggested by David Miller.

Changes from v3 to v4
 - Used non deprecated version of nla_parse_nested
 - Used SZ_32M macro
 - Removed an unused variable
 - Code cleanup
 All suggested by Jakub and Toke.

Changes from v2 to v3
 - Exported drop_early, pie_process_dequeue and
   calculate_probability functions from sch_pie as
   suggested by Stephen Hemminger.

Changes from v1 ( and RFC patch) to v2
 - Added timestamp to calculate queue delay as recommended
   by Dave Taht
 - Packet memory limit implemented as recommended by Toke.
 - Added external classifier as recommended by Toke.
 - Used NET_XMIT_CN instead of NET_XMIT_DROP as the return
   value in the fq_pie_qdisc_enqueue function.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: add Flow Queue PIE packet scheduler
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:33 +0000 (23:52 +0530)]
net: sched: add Flow Queue PIE packet scheduler

Principles:
  - Packets are classified on flows.
  - This is a Stochastic model (as we use a hash, several flows might
                                be hashed to the same slot)
  - Each flow has a PIE managed queue.
  - Flows are linked onto two (Round Robin) lists,
    so that new flows have priority on old ones.
  - For a given flow, packets are not reordered.
  - Drops during enqueue only.
  - ECN capability is off by default.
  - ECN threshold (if ECN is enabled) is at 10% by default.
  - Uses timestamps to calculate queue delay by default.

Usage:
tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
                    [ target TIME ] [ tupdate TIME ]
                    [ alpha NUMBER ] [ beta NUMBER ]
                    [ quantum BYTES ] [ memory_limit BYTES ]
                    [ ecnprob PERCENTAGE ] [ [no]ecn ]
                    [ [no]bytemode ] [ [no_]dq_rate_estimator ]

defaults:
  limit: 10240 packets, flows: 1024
  target: 15 ms, tupdate: 15 ms (in jiffies)
  alpha: 1/8, beta : 5/4
  quantum: device MTU, memory_limit: 32 Mb
  ecnprob: 10%, ecn: off
  bytemode: off, dq_rate_estimator: off

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com>
Signed-off-by: V. Saicharan <vsaicharan1998@gmail.com>
Signed-off-by: Mohit Bhasi <mohitbhasi1998@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: pie: export symbols to be reused by FQ-PIE
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:32 +0000 (23:52 +0530)]
net: sched: pie: export symbols to be reused by FQ-PIE

This patch makes the drop_early(), calculate_probability() and
pie_process_dequeue() functions generic enough to be used by
both PIE and FQ-PIE (to be added in a future commit). The major
change here is in the way the functions take in arguments. This
patch exports these functions and makes FQ-PIE dependent on
sch_pie.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: pie: fix alignment in struct instances
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:31 +0000 (23:52 +0530)]
net: sched: pie: fix alignment in struct instances

Make the alignment in the initialization of the struct instances
consistent in the file.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: pie: fix commenting
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:30 +0000 (23:52 +0530)]
net: sched: pie: fix commenting

Fix punctuation and logical mistakes in the comments. The
logical mistake was that "dequeue_rate" is no longer the default
way to calculate queuing delay and is not needed. The default
way to calculate queue delay was changed in commit 941dd9a6807e
("net: sched: pie: enable timestamp based delay calculation").

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agopie: improve comments and commenting style
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:29 +0000 (23:52 +0530)]
pie: improve comments and commenting style

Improve the comments along with the commenting style used to
describe the members of the structures and their initial values
in the init functions.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agopie: rearrange structure members and their initializations
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:28 +0000 (23:52 +0530)]
pie: rearrange structure members and their initializations

Rearrange the members of the structure such that closely
referenced members appear together and/or fit in the same
cacheline. Also, change the order of their initializations to
match the order in which they appear in the structure.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agopie: use u8 instead of bool in pie_vars
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:27 +0000 (23:52 +0530)]
pie: use u8 instead of bool in pie_vars

Linux best practice recommends using u8 for true/false values in
structures.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agopie: rearrange macros in order of length
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:26 +0000 (23:52 +0530)]
pie: rearrange macros in order of length

Rearrange macros in order of length and align the values to
improve readability.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agopie: use U64_MAX to denote (2^64 - 1)
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:25 +0000 (23:52 +0530)]
pie: use U64_MAX to denote (2^64 - 1)

Use the U64_MAX macro to denote the constant (2^64 - 1).

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: sched: pie: move common code to pie.h
Mohit P. Tahiliani [Wed, 22 Jan 2020 18:22:24 +0000 (23:52 +0530)]
net: sched: pie: move common code to pie.h

This patch moves macros, structures and small functions common
to PIE and FQ-PIE (to be added in a future commit) from the file
net/sched/sch_pie.c to the header file include/net/pie.h.
All the moved functions are made inline.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: convert suitable drivers to use phy_do_ioctl_running
Heiner Kallweit [Tue, 21 Jan 2020 21:09:33 +0000 (22:09 +0100)]
net: convert suitable drivers to use phy_do_ioctl_running

Convert suitable drivers to use new helper phy_do_ioctl_running.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Acked-by: Timur Tabi <timur@kernel.org>
Acked-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
David S. Miller [Thu, 23 Jan 2020 07:10:16 +0000 (08:10 +0100)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-01-22

The following pull-request contains BPF updates for your *net-next* tree.

We've added 92 non-merge commits during the last 16 day(s) which contain
a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).

The main changes are:

1) function by function verification and program extensions from Alexei.

2) massive cleanup of selftests/bpf from Toke and Andrii.

3) batched bpf map operations from Brian and Yonghong.

4) tcp congestion control in bpf from Martin.

5) bulking for non-map xdp_redirect form Toke.

6) bpf_send_signal_thread helper from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet/mlx5e: Enable all available stats for uplink reps
Vlad Buslov [Tue, 21 Jan 2020 18:08:42 +0000 (20:08 +0200)]
net/mlx5e: Enable all available stats for uplink reps

Extend stats group array of uplink representor with all stats that are
available for PF in legacy mode, besides ipsec and TLS which are not
supported.

Don't output vport stats for uplink representor because they are already
handled by 802_3 group (with different names: {tx|rx}_{bytes|packets}_phy).

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: Create q counters on uplink representors
Vlad Buslov [Tue, 21 Jan 2020 19:36:40 +0000 (21:36 +0200)]
net/mlx5e: Create q counters on uplink representors

Q counters were disabled for all types of representors to prevent an issue
where there is not enough resources to init q counters for 127 representor
instances. Enable q counters only for uplink representors to support
"rx_out_of_buffer", "rx_if_down_packets" counters in ethtool.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: Convert rep stats to mlx5e_stats_grp-based infra
Vlad Buslov [Tue, 21 Jan 2020 17:38:21 +0000 (19:38 +0200)]
net/mlx5e: Convert rep stats to mlx5e_stats_grp-based infra

In order to support all of the supported stats that are available in legacy
mode for switchdev uplink representors, convert rep stats infrastructure to
reuse struct mlx5e_stats_grp that is already used when device is in legacy
mode. Refactor rep code to use array of mlx5e_stats_grp
structures (constructed using macros provided by stats infra) to
fill/update stats, instead of fixed hardcoded set of values. This approach
allows to easily extend representors with new stats types.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: IPoIB, use separate stats groups
Saeed Mahameed [Tue, 21 Jan 2020 09:15:28 +0000 (01:15 -0800)]
net/mlx5e: IPoIB, use separate stats groups

Don't copy all of the stats groups used for mlx5e ethernet NIC profile,
have a separate stats groups for IPoIB with the set of the needed stats
only.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: Convert stats groups array to array of group pointers
Saeed Mahameed [Tue, 21 Jan 2020 08:54:00 +0000 (00:54 -0800)]
net/mlx5e: Convert stats groups array to array of group pointers

Convert stats groups array to array of "stats group" pointers to allow
sharing and individual selection of groups per profile as illustrated in
the next patches.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
4 years agonet/mlx5e: Declare stats groups via macro
Saeed Mahameed [Tue, 21 Jan 2020 08:24:53 +0000 (00:24 -0800)]
net/mlx5e: Declare stats groups via macro

Introduce new macros to declare stats callbacks and groups, for better
code reuse and for individual groups selection per profile which will be
introduced in next patches.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
4 years agonet/mlx5e: Profile specific stats groups
Saeed Mahameed [Tue, 21 Jan 2020 06:32:12 +0000 (22:32 -0800)]
net/mlx5e: Profile specific stats groups

Attach stats groups array to the profiles and make the stats utility
functions (get_num, update, fill, fill_strings) generic and use the
profile->stats_grps rather the hardcoded NIC stats groups.

This will allow future extension to have per profile stats groups.

In this patch mlx5e NIC and IPoIB will still share the same stats
groups.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
4 years agonet/mlx5e: Move uplink rep init/cleanup code into own functions
Roi Dayan [Wed, 23 Oct 2019 06:59:36 +0000 (09:59 +0300)]
net/mlx5e: Move uplink rep init/cleanup code into own functions

Clean up the code and allows to call uplink rep init/cleanup
from different location later.
To be used later for a new uplink representor mode.

Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: DR, Allow connecting flow table to a lower/same level table
Yevgeny Kliteynik [Mon, 20 Jan 2020 09:51:36 +0000 (11:51 +0200)]
net/mlx5: DR, Allow connecting flow table to a lower/same level table

Allow connecting SW steering source table to a lower/same level
destination table.
Lifting this limitation is required to support Connection Tracking.

Signed-off-by: Yevgeny Kliteynik <kliteyn@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: DR, Modify header copy support
Hamdan Igbaria [Thu, 9 Jan 2020 11:27:16 +0000 (13:27 +0200)]
net/mlx5: DR, Modify header copy support

Modify header supports ADD/SET and from this patch
also COPY. Copy allows to copy header fields and
metadata.

Signed-off-by: Hamdan Igbaria <hamdani@mellanox.com>
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: DR, Modify set action limitation extension
Hamdan Igbaria [Tue, 24 Dec 2019 16:07:41 +0000 (18:07 +0200)]
net/mlx5: DR, Modify set action limitation extension

Modify set actions are not supported on both tx
and rx, added a check for that.
Also refactored the code in a way that every modify
action has his own functions, this needed so in the
future we could add copy action more smoothly.

Signed-off-by: Hamdan Igbaria <hamdani@mellanox.com>
Signed-off-by: Alex Vesker <valex@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: Add mlx5e_flower_parse_meta support
wenxu [Tue, 7 Jan 2020 09:16:06 +0000 (17:16 +0800)]
net/mlx5e: Add mlx5e_flower_parse_meta support

In the flowtables offload all the devices in the flowtables
share the same flow_block. An offload rule will be installed on
all the devices. This scenario is not correct.

It is no problem if there are only two devices in the flowtable,
The rule with ingress and egress on the same device can be reject
by driver.

But more than two devices in the flowtable will install the wrong
rules on hardware.

For example:
Three devices in a offload flowtables: dev_a, dev_b, dev_c

A rule ingress from dev_a and egress to dev_b:
The rule will install on device dev_a.
The rule will try to install on dev_b but failed for ingress
and egress on the same device.
The rule will install on dev_c. This is not correct.

The flowtables offload avoid this case through restricting the ingress dev
with FLOW_DISSECTOR_KEY_META.

So the mlx5e driver also should support the FLOW_DISSECTOR_KEY_META parse.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5: make the symbol 'ESW_POOLS' static
Chen Wandun [Mon, 20 Jan 2020 12:41:53 +0000 (20:41 +0800)]
net/mlx5: make the symbol 'ESW_POOLS' static

Fix the following sparse warning:
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads_chains.c:35:20: warning: symbol 'ESW_POOLS' was not declared. Should it be static?

Fixes: 5c29d0c3d1d3 ("net/mlx5: E-Switch, Refactor chains and priorities")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Acked-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: allow TSO on VXLAN over VLAN topologies
Davide Caratti [Thu, 9 Jan 2020 11:07:59 +0000 (12:07 +0100)]
net/mlx5e: allow TSO on VXLAN over VLAN topologies

since mlx5 hardware can segment correctly TSO packets on VXLAN over VLAN
topologies, CPU usage can improve significantly if we enable tunnel
offloads in dev->vlan_features, like it was done in the past with other
NIC drivers (e.g. mlx4, be2net and ixgbe).

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agonet/mlx5e: Fix printk format warning
Olof Johansson [Fri, 20 Dec 2019 00:15:17 +0000 (16:15 -0800)]
net/mlx5e: Fix printk format warning

Use "%zu" for size_t. Seen on ARM allmodconfig:

drivers/net/ethernet/mellanox/mlx5/core/wq.c: In function 'mlx5_wq_cyc_wqe_dump':
include/linux/kern_levels.h:5:18: warning: format '%ld' expects argument of type 'long int', but argument 5 has type 'size_t' {aka 'unsigned int'} [-Wformat=]

Fixes: 4e23b5774fd3 ("net/mlx5e: TX, Dump WQs wqe descriptors on CQE with error events")
Signed-off-by: Olof Johansson <olof@lixom.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
4 years agoMerge branch 'bpf_cubic'
Alexei Starovoitov [Thu, 23 Jan 2020 00:30:11 +0000 (16:30 -0800)]
Merge branch 'bpf_cubic'

Martin KaFai Lau says:

====================
This set adds bpf_cubic.c example.  It was separated from the
earlier BPF STRUCT_OPS series.  Some highlights since the
last post:

1. It is based on EricD recent fixes to the kernel tcp_cubic. [1]
2. The bpf jiffies reading helper is inlined by the verifier.
   Different from the earlier version, it only reads jiffies alone
   and does not do usecs/jiffies conversion.
3. The bpf .kconfig map is used to read CONFIG_HZ.

[1]: https://patchwork.ozlabs.org/cover/1215066/

v3:
- Remove __weak from CONFIG_HZ in patch 3. (Andrii)

v2:
- Move inlining to fixup_bpf_calls() in patch 1. (Daniel)
- It is inlined for 64 BITS_PER_LONG and jit_requested
  as the map_gen_lookup().  Other cases could be
  considered together with map_gen_lookup() if needed.
- Use usec resolution in bictcp_update() calculation in patch 3.
  usecs_to_jiffies() is then removed().  (Eric)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agobpf: tcp: Add bpf_cubic example
Martin KaFai Lau [Wed, 22 Jan 2020 23:36:58 +0000 (15:36 -0800)]
bpf: tcp: Add bpf_cubic example

This patch adds a bpf_cubic example.  Some highlights:
1. CONFIG_HZ .kconfig map is used.
2. In bictcp_update(), calculation is changed to use usec
   resolution (i.e. USEC_PER_JIFFY) instead of using jiffies.
   Thus, usecs_to_jiffies() is not used in the bpf_cubic.c.
3. In bitctcp_update() [under tcp_friendliness], the original
   "while (ca->ack_cnt > delta)" loop is changed to the equivalent
   "ca->ack_cnt / delta" operation.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200122233658.903774-1-kafai@fb.com
4 years agobpf: Sync uapi bpf.h to tools/
Martin KaFai Lau [Wed, 22 Jan 2020 23:36:52 +0000 (15:36 -0800)]
bpf: Sync uapi bpf.h to tools/

This patch sync uapi bpf.h to tools/.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200122233652.903348-1-kafai@fb.com
4 years agobpf: Add BPF_FUNC_jiffies64
Martin KaFai Lau [Wed, 22 Jan 2020 23:36:46 +0000 (15:36 -0800)]
bpf: Add BPF_FUNC_jiffies64

This patch adds a helper to read the 64bit jiffies.  It will be used
in a later patch to implement the bpf_cubic.c.

The helper is inlined for jit_requested and 64 BITS_PER_LONG
as the map_gen_lookup().  Other cases could be considered together
with map_gen_lookup() if needed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200122233646.903260-1-kafai@fb.com
4 years agoMerge branch 'bpf-dynamic-relinking'
Daniel Borkmann [Wed, 22 Jan 2020 22:04:53 +0000 (23:04 +0100)]
Merge branch 'bpf-dynamic-relinking'

Alexei Starovoitov says:

====================
The last few month BPF community has been discussing an approach to call
chaining, since exiting bpt_tail_call() mechanism used in production XDP
programs has plenty of downsides. The outcome of these discussion was a
conclusion to implement dynamic re-linking of BPF programs. Where rootlet XDP
program attached to a netdevice can programmatically define a policy of
execution of other XDP programs. Such rootlet would be compiled as normal XDP
program and provide a number of placeholder global functions which later can be
replaced with future XDP programs. BPF trampoline, function by function
verification were building blocks towards that goal. The patch 1 is a final
building block. It introduces dynamic program extensions. A number of
improvements like more flexible function by function verification and better
libbpf api will be implemented in future patches.

v1->v2:
- addressed Andrii's comments
- rebase
====================

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
4 years agoselftests/bpf: Add tests for program extensions
Alexei Starovoitov [Tue, 21 Jan 2020 00:53:48 +0000 (16:53 -0800)]
selftests/bpf: Add tests for program extensions

Add program extension tests that build on top of fexit_bpf2bpf tests.
Replace three global functions in previously loaded test_pkt_access.c program
with three new implementations:
int get_skb_len(struct __sk_buff *skb);
int get_constant(long val);
int get_skb_ifindex(int val, struct __sk_buff *skb, int var);
New function return the same results as original only if arguments match.

new_get_skb_ifindex() demonstrates that 'skb' argument doesn't have to be first
and only argument of BPF program. All normal skb based accesses are available.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200121005348.2769920-4-ast@kernel.org
4 years agolibbpf: Add support for program extensions
Alexei Starovoitov [Tue, 21 Jan 2020 00:53:47 +0000 (16:53 -0800)]
libbpf: Add support for program extensions

Add minimal support for program extensions. bpf_object_open_opts() needs to be
called with attach_prog_fd = target_prog_fd and BPF program extension needs to
have in .c file section definition like SEC("freplace/func_to_be_replaced").
libbpf will search for "func_to_be_replaced" in the target_prog_fd's BTF and
will pass it in attach_btf_id to the kernel. This approach works for tests, but
more compex use case may need to request function name (and attach_btf_id that
kernel sees) to be more dynamic. Such API will be added in future patches.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200121005348.2769920-3-ast@kernel.org
4 years agobpf: Introduce dynamic program extensions
Alexei Starovoitov [Tue, 21 Jan 2020 00:53:46 +0000 (16:53 -0800)]
bpf: Introduce dynamic program extensions

Introduce dynamic program extensions. The users can load additional BPF
functions and replace global functions in previously loaded BPF programs while
these programs are executing.

Global functions are verified individually by the verifier based on their types only.
Hence the global function in the new program which types match older function can
safely replace that corresponding function.

This new function/program is called 'an extension' of old program. At load time
the verifier uses (attach_prog_fd, attach_btf_id) pair to identify the function
to be replaced. The BPF program type is derived from the target program into
extension program. Technically bpf_verifier_ops is copied from target program.
The BPF_PROG_TYPE_EXT program type is a placeholder. It has empty verifier_ops.
The extension program can call the same bpf helper functions as target program.
Single BPF_PROG_TYPE_EXT type is used to extend XDP, SKB and all other program
types. The verifier allows only one level of replacement. Meaning that the
extension program cannot recursively extend an extension. That also means that
the maximum stack size is increasing from 512 to 1024 bytes and maximum
function nesting level from 8 to 16. The programs don't always consume that
much. The stack usage is determined by the number of on-stack variables used by
the program. The verifier could have enforced 512 limit for combined original
plus extension program, but it makes for difficult user experience. The main
use case for extensions is to provide generic mechanism to plug external
programs into policy program or function call chaining.

BPF trampoline is used to track both fentry/fexit and program extensions
because both are using the same nop slot at the beginning of every BPF
function. Attaching fentry/fexit to a function that was replaced is not
allowed. The opposite is true as well. Replacing a function that currently
being analyzed with fentry/fexit is not allowed. The executable page allocated
by BPF trampoline is not used by program extensions. This inefficiency will be
optimized in future patches.

Function by function verification of global function supports scalars and
pointer to context only. Hence program extensions are supported for such class
of global functions only. In the future the verifier will be extended with
support to pointers to structures, arrays with sizes, etc.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200121005348.2769920-2-ast@kernel.org
4 years agonet: convert additional drivers to use phy_do_ioctl
Heiner Kallweit [Tue, 21 Jan 2020 21:05:14 +0000 (22:05 +0100)]
net: convert additional drivers to use phy_do_ioctl

The first batch of driver conversions missed a few cases where we can
use phy_do_ioctl too.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agobpf, btf: Always output invariant hit in pahole DWARF to BTF transform
Chris Down [Wed, 22 Jan 2020 00:01:10 +0000 (00:01 +0000)]
bpf, btf: Always output invariant hit in pahole DWARF to BTF transform

When trying to compile with CONFIG_DEBUG_INFO_BTF enabled, I got this
error:

    % make -s
    Failed to generate BTF for vmlinux
    Try to disable CONFIG_DEBUG_INFO_BTF
    make[3]: *** [vmlinux] Error 1

Compiling again without -s shows the true error (that pahole is
missing), but since this is fatal, we should show the error
unconditionally on stderr as well, not silence it using the `info`
function. With this patch:

    % make -s
    BTF: .tmp_vmlinux.btf: pahole (pahole) is not available
    Failed to generate BTF for vmlinux
    Try to disable CONFIG_DEBUG_INFO_BTF
    make[3]: *** [vmlinux] Error 1

Signed-off-by: Chris Down <chris@chrisdown.name>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200122000110.GA310073@chrisdown.name
4 years agoselftests/bpf: Build urandom_read with LDFLAGS and LDLIBS
Daniel Díaz [Wed, 22 Jan 2020 16:44:24 +0000 (17:44 +0100)]
selftests/bpf: Build urandom_read with LDFLAGS and LDLIBS

During cross-compilation, it was discovered that LDFLAGS and
LDLIBS were not being used while building binaries, leading
to defaults which were not necessarily correct.

OpenEmbedded reported this kind of problem:

  ERROR: QA Issue: No GNU_HASH in the ELF binary [...], didn't pass LDFLAGS?

Signed-off-by: Daniel Díaz <daniel.diaz@linaro.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
4 years agobpf: Fix error path under memory pressure
Alexei Starovoitov [Wed, 22 Jan 2020 02:41:38 +0000 (18:41 -0800)]
bpf: Fix error path under memory pressure

Restore the 'if (env->cur_state)' check that was incorrectly removed during
code move. Under memory pressure env->cur_state can be freed and zeroed inside
do_check(). Hence the check is necessary.

Fixes: 24810a531086 ("bpf: Introduce function-by-function verification")
Reported-by: syzbot+b296579ba5015704d9fa@syzkaller.appspotmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200122024138.3385590-1-ast@kernel.org
4 years agobpf: Fix trampoline usage in preempt
Alexei Starovoitov [Tue, 21 Jan 2020 03:22:31 +0000 (19:22 -0800)]
bpf: Fix trampoline usage in preempt

Though the second half of trampoline page is unused a task could be
preempted in the middle of the first half of trampoline and two
updates to trampoline would change the code from underneath the
preempted task. Hence wait for tasks to voluntarily schedule or go
to userspace. Add similar wait before freeing the trampoline.

Fixes: 34fb660b89c4 ("bpf: Introduce BPF trampoline")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
Link: https://lore.kernel.org/bpf/20200121032231.3292185-1-ast@kernel.org
4 years agoxsk, net: Make sock_def_readable() have external linkage
Björn Töpel [Mon, 20 Jan 2020 09:29:17 +0000 (10:29 +0100)]
xsk, net: Make sock_def_readable() have external linkage

XDP sockets use the default implementation of struct sock's
sk_data_ready callback, which is sock_def_readable(). This function
is called in the XDP socket fast-path, and involves a retpoline. By
letting sock_def_readable() have external linkage, and being called
directly, the retpoline can be avoided.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200120092917.13949-1-bjorn.topel@gmail.com
4 years agobpf: don't bother with getname/kern_path - use user_path_at
Al Viro [Mon, 20 Jan 2020 23:28:58 +0000 (23:28 +0000)]
bpf: don't bother with getname/kern_path - use user_path_at

kernel/bpf/inode.c misuses kern_path...() - it's much simpler (and
more efficient, on top of that) to use user_path...() counterparts
rather than bothering with doing getname() manually.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200120232858.GF8904@ZenIV.linux.org.uk
4 years agoMerge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec...
David S. Miller [Tue, 21 Jan 2020 11:18:20 +0000 (12:18 +0100)]
Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next

Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2020-01-21

1) Add support for TCP encapsulation of IKE and ESP messages,
   as defined by RFC 8229. Patchset from Sabrina Dubroca.

Please note that there is a merge conflict in:

net/unix/af_unix.c

between commit:

87edf96f73c2 ("unix: Show number of pending scm files of receive queue in fdinfo")

from the net-next tree and commit:

3f4c19cf56b3 ("net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagram")

from the ipsec-next tree.

The conflict can be solved as done in linux-next.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agodrivers: net: declance: fix comparing pointer to 0
Chen Zhou [Tue, 21 Jan 2020 09:24:55 +0000 (17:24 +0800)]
drivers: net: declance: fix comparing pointer to 0

Fixes coccicheck warning:

./drivers/net/ethernet/amd/declance.c:611:14-15:
WARNING comparing pointer to 0

Replace "skb == 0" with "!skb".

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agotcp/ipv4: remove AF_INET_FAMILY
Alex Shi [Tue, 21 Jan 2020 08:50:07 +0000 (16:50 +0800)]
tcp/ipv4: remove AF_INET_FAMILY

After commit aaf9f4668c2b ("tcp/dccp: install syn_recv requests into ehash table")
the macro isn't used anymore. remove it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet/hsr: remove seq_nr_after_or_eq
Alex Shi [Tue, 21 Jan 2020 08:49:53 +0000 (16:49 +0800)]
net/hsr: remove seq_nr_after_or_eq

It's never used after introduced. So maybe better to remove.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Arvid Brodin <arvid.brodin@alten.se>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agohdlx_x25: Fix backwards compat test.
David S. Miller [Tue, 21 Jan 2020 11:02:25 +0000 (12:02 +0100)]
hdlx_x25: Fix backwards compat test.

drivers/net/wan/hdlc_x25.c: In function ‘x25_ioctl’:
drivers/net/wan/hdlc_x25.c:256:7: warning: suggest parentheses around assignment used as truth value [-Wparentheses]
  256 |   if (ifr->ifr_settings.size = 0) {
      |       ^~~

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'hns3-next'
David S. Miller [Tue, 21 Jan 2020 10:46:21 +0000 (11:46 +0100)]
Merge branch 'hns3-next'

Huazhong Tan says:

====================
net: hns3: misc updates for -net-next

This series includes some misc updates for the HNS3 ethernet driver.

[patch 1] adds a limitation for the error log in the
hns3_clean_tx_ring().
[patch 2] adds a check for pfmemalloc flag before reusing pages
since these pages may be used some special case.
[patch 3] assigns a default reset type 'HNAE3_NONE_RESET' to
VF's reset_type after initializing or reset.
[patch 4] unifies macro HCLGE_DFX_REG_TYPE_CNT's definition into
header file.
[patch 5] refines the parameter 'size' of snprintf() in the
hns3_init_module().
[patch 6] rewrites a debug message in hclge_put_vector().
[patch 7~9] adds some cleanups related to coding style.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: cleanup some coding style issue
Huazhong Tan [Tue, 21 Jan 2020 08:42:13 +0000 (16:42 +0800)]
net: hns3: cleanup some coding style issue

This patch removes some unnecessary return value assignments,
some duplicated printing in the caller, refines the judgment
of 0 and uses le16_to_cpu to replace __le16_to_cpu.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: remove redundant print on ENOMEM
Huazhong Tan [Tue, 21 Jan 2020 08:42:12 +0000 (16:42 +0800)]
net: hns3: remove redundant print on ENOMEM

All kmalloc-based functions print enough information on failures.
So this patch removes the log in hclge_get_dfx_reg() when returns
ENOMEM.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: delete unnecessary blank line and space for cleanup
Guangbin Huang [Tue, 21 Jan 2020 08:42:11 +0000 (16:42 +0800)]
net: hns3: delete unnecessary blank line and space for cleanup

This patch deletes some unnecessary blank lines and spaces to clean up
code, and in hclgevf_set_vlan_filter() moves the comment to the front
of hclgevf_send_mbx_msg().

Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: rewrite a log in hclge_put_vector()
Yonglong Liu [Tue, 21 Jan 2020 08:42:10 +0000 (16:42 +0800)]
net: hns3: rewrite a log in hclge_put_vector()

When gets vector fails, hclge_put_vector() should print out
the vector instead of vector_id in the log and return the wrong
vector_id to its caller.

Signed-off-by: Yonglong Liu <liuyonglong@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: refine the input parameter 'size' for snprintf()
Guojia Liao [Tue, 21 Jan 2020 08:42:09 +0000 (16:42 +0800)]
net: hns3: refine the input parameter 'size' for snprintf()

The function snprintf() writes at most size bytes (including the
terminating null byte ('\0') to str. Now, We can guarantee that the
parameter of size is lager than the length of str to be formatting
including its terminating null byte. So it's unnecessary to minus 1
for the input parameter 'size'.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: move duplicated macro definition into header
Guojia Liao [Tue, 21 Jan 2020 08:42:08 +0000 (16:42 +0800)]
net: hns3: move duplicated macro definition into header

Macro HCLGE_GET_DFX_REG_TYPE_CNT in hclge_dbg_get_dfx_bd_num()
and macro HCLGE_DFX_REG_BD_NUM in hclge_get_dfx_reg_bd_num()
have the same meaning, so just defines HCLGE_GET_DFX_REG_TYPE_CNT
in hclge_main.h.

Signed-off-by: Guojia Liao <liaoguojia@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: set VF's default reset_type to HNAE3_NONE_RESET
Huazhong Tan [Tue, 21 Jan 2020 08:42:07 +0000 (16:42 +0800)]
net: hns3: set VF's default reset_type to HNAE3_NONE_RESET

reset_type means what kind of reset the driver is handling now,
so after initializing or reset, the reset_type of VF should be
set to HNAE3_NONE_RESET, otherwise, this unknown default value
may be a little misleading when the device is running.

Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: do not reuse pfmemalloc pages
Yunsheng Lin [Tue, 21 Jan 2020 08:42:06 +0000 (16:42 +0800)]
net: hns3: do not reuse pfmemalloc pages

HNS3 driver allocates pages for DMA with dev_alloc_pages(), which
calls alloc_pages_node() with the __GFP_MEMALLOC flag. So, in case
of OOM condition, HNS3 can get pages with pfmemalloc flag set.

So do not reuse the pages with pfmemalloc flag set because those
pages are reserved for special cases, such as low memory case.

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: limit the error logging in the hns3_clean_tx_ring()
Yunsheng Lin [Tue, 21 Jan 2020 08:42:05 +0000 (16:42 +0800)]
net: hns3: limit the error logging in the hns3_clean_tx_ring()

The error log printed by netdev_err() in the hns3_clean_tx_ring()
may spam the kernel log.

This patch uses hns3_rl_err() to ratelimit the error log in the
hns3_clean_tx_ring().

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agowan/hdlc_x25: fix skb handling
Martin Schiller [Tue, 21 Jan 2020 06:00:34 +0000 (07:00 +0100)]
wan/hdlc_x25: fix skb handling

o call skb_reset_network_header() before hdlc->xmit()
 o change skb proto to HDLC (0x0019) before hdlc->xmit()
 o call dev_queue_xmit_nit() before hdlc->xmit()

This changes make it possible to trace (tcpdump) outgoing layer2
(ETH_P_HDLC) packets

Additionally call skb_reset_network_header() after each skb_push() /
skb_pull().

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agowan/hdlc_x25: make lapb params configurable
Martin Schiller [Tue, 21 Jan 2020 06:00:33 +0000 (07:00 +0100)]
wan/hdlc_x25: make lapb params configurable

This enables you to configure mode (DTE/DCE), Modulo, Window, T1, T2, N2 via
sethdlc (which needs to be patched as well).

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet/smc: allow unprivileged users to read pnet table
Hans Wippel [Tue, 21 Jan 2020 00:04:46 +0000 (01:04 +0100)]
net/smc: allow unprivileged users to read pnet table

The current flags of the SMC_PNET_GET command only allow privileged
users to retrieve entries from the pnet table via netlink. The content
of the pnet table may be useful for all users though, e.g., for
debugging smc connection problems.

This patch removes the GENL_ADMIN_PERM flag so that unprivileged users
can read the pnet table.

Signed-off-by: Hans Wippel <ndev@hwipl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'phy-add-new-version-of-phy_do_ioctl-and-convert-suitable-drivers'
David S. Miller [Tue, 21 Jan 2020 09:50:41 +0000 (10:50 +0100)]
Merge branch 'phy-add-new-version-of-phy_do_ioctl-and-convert-suitable-drivers'

Heiner Kallweit says:

====================
net: phy: add new version of phy_do_ioctl and convert suitable drivers

We just added phy_do_ioctl, but it turned out that we need another
version of this function that doesn't check whether net_device is
running. So rename phy_do_ioctl to phy_do_ioctl_running and add a
new version of phy_do_ioctl. Eventually convert suitable drivers
to use phy_do_ioctl.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: convert suitable network drivers to use phy_do_ioctl
Heiner Kallweit [Mon, 20 Jan 2020 21:18:37 +0000 (22:18 +0100)]
net: convert suitable network drivers to use phy_do_ioctl

Convert suitable network drivers to use phy_do_ioctl.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: add new version of phy_do_ioctl
Heiner Kallweit [Mon, 20 Jan 2020 21:17:11 +0000 (22:17 +0100)]
net: phy: add new version of phy_do_ioctl

Add a new version of phy_do_ioctl that doesn't check whether net_device
is running. It will typically be used if suitable drivers attach the
PHY in probe already.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: phy: rename phy_do_ioctl to phy_do_ioctl_running
Heiner Kallweit [Mon, 20 Jan 2020 21:16:07 +0000 (22:16 +0100)]
net: phy: rename phy_do_ioctl to phy_do_ioctl_running

We just added phy_do_ioctl, but it turned out that we need another
version of this function that doesn't check whether net_device is
running. So rename phy_do_ioctl to phy_do_ioctl_running.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: replace snprintf with scnprintf in hns3_update_strings
Chen Zhou [Mon, 20 Jan 2020 12:50:33 +0000 (20:50 +0800)]
net: hns3: replace snprintf with scnprintf in hns3_update_strings

snprintf returns the number of bytes that would be written, which may be
greater than the the actual length to be written. Here use extra code to
handle this.

scnprintf returns the number of bytes that was actually written, just use
scnprintf to simplify the code.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agonet: hns3: replace snprintf with scnprintf in hns3_dbg_cmd_read
Chen Zhou [Mon, 20 Jan 2020 12:49:43 +0000 (20:49 +0800)]
net: hns3: replace snprintf with scnprintf in hns3_dbg_cmd_read

The return value of snprintf may be greater than the size of
HNS3_DBG_READ_LEN, use scnprintf instead in hns3_dbg_cmd_read.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge tag 'rds-odp-for-5.5' of https://git.kernel.org/pub/scm/linux/kernel/git/leon...
David S. Miller [Tue, 21 Jan 2020 09:22:51 +0000 (10:22 +0100)]
Merge tag 'rds-odp-for-5.5' of https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma

Leon Romanovsky says:

====================
Use ODP MRs for kernel ULPs

The following series extends MR creation routines to allow creation of
user MRs through kernel ULPs as a proxy. The immediate use case is to
allow RDS to work over FS-DAX, which requires ODP (on-demand-paging)
MRs to be created and such MRs were not possible to create prior this
series.

The first part of this patchset extends RDMA to have special verb
ib_reg_user_mr(). The common use case that uses this function is a
userspace application that allocates memory for HCA access but the
responsibility to register the memory at the HCA is on an kernel ULP.
This ULP acts as an agent for the userspace application.

The second part provides advise MR functionality for ULPs. This is
integral part of ODP flows and used to trigger pagefaults in advance
to prepare memory before running working set.

The third part is actual user of those in-kernel APIs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
4 years agoMerge branch 'libbpf-include-path'
Alexei Starovoitov [Tue, 21 Jan 2020 00:37:46 +0000 (16:37 -0800)]
Merge branch 'libbpf-include-path'

Toke Høiland-Jørgensen says:

====================
We are currently being somewhat inconsistent with the libbpf include paths,
which makes it difficult to move files from the kernel into an external
libbpf-using project without adjusting include paths.

Having the bpf/ subdir of $INCLUDEDIR in the include path has never been a
requirement for building against libbpf before, and indeed the libbpf pkg-config
file doesn't include it. So let's make all libbpf includes across the kernel
tree use the bpf/ prefix in their includes. Since bpftool skeleton generation
emits code with a libbpf include, this also ensures that those can be used in
existing external projects using the regular pkg-config include path.

This turns out to be a somewhat invasive change in the number of files touched;
however, the actual changes to files are fairly trivial (most of them are simply
made with 'sed'). The series is split to make the change for one tool subdir at
a time, while trying not to break the build along the way. It is structured like
this:

- Patch 1-3: Trivial fixes to Makefiles for issues I discovered while changing
  the include paths.

- Patch 4-8: Change the include directives to use the bpf/ prefix, and updates
  Makefiles to make sure tools/lib/ is part of the include path, but without
  removing tools/lib/bpf

- Patch 9-11: Remove tools/lib/bpf from include paths to make sure we don't
  inadvertently re-introduce includes without the bpf/ prefix.

Changelog:

v5:
  - Combine the libbpf build rules in selftests Makefile (using Andrii's
    suggestion for a make rule).
  - Re-use self-tests libbpf build for runqslower (new patch 10)
  - Formatting fixes

v4:
  - Move runqslower error on missing BTF into make rule
  - Make sure we don't always force a rebuild selftests
  - Rebase on latest bpf-next (dropping patch 11)

v3:
  - Don't add the kernel build dir to the runqslower Makefile, pass it in from
    selftests instead.
  - Use libbpf's 'make install_headers' in selftests instead of trying to
    generate bpf_helper_defs.h in-place (to also work on read-only filesystems).
  - Use a scratch builddir for both libbpf and bpftool when building in selftests.
  - Revert bpf_helpers.h to quoted include instead of angled include with a bpf/
    prefix.
  - Fix a few style nits from Andrii

v2:
  - Do a full cleanup of libbpf includes instead of just changing the
    bpf_helper_defs.h include.
====================

Acked-by: Andrii Nakryiko <andriin@fb.com>
Tested-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
4 years agoselftests: Refactor build to remove tools/lib/bpf from include path
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:52 +0000 (14:06 +0100)]
selftests: Refactor build to remove tools/lib/bpf from include path

To make sure no new files are introduced that doesn't include the bpf/
prefix in its #include, remove tools/lib/bpf from the include path
entirely.

Instead, we introduce a new header files directory under the scratch tools/
dir, and add a rule to run the 'install_headers' rule from libbpf to have a
full set of consistent libbpf headers in $(OUTPUT)/tools/include/bpf, and
then use $(OUTPUT)/tools/include as the include path for selftests.

For consistency we also make sure we put all the scratch build files from
other bpftool and libbpf into tools/build/, so everything stays within
selftests/.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/157952561246.1683545.2762245552022369203.stgit@toke.dk
4 years agorunsqslower: Support user-specified libbpf include and object paths
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:51 +0000 (14:06 +0100)]
runsqslower: Support user-specified libbpf include and object paths

This adds support for specifying the libbpf include and object paths as
arguments to the runqslower Makefile, to support reusing the libbpf version
built as part of the selftests.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/157952561135.1683545.5660339645093141381.stgit@toke.dk
4 years agotools/runqslower: Remove tools/lib/bpf from include path
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:50 +0000 (14:06 +0100)]
tools/runqslower: Remove tools/lib/bpf from include path

Since we are now consistently using the bpf/ prefix on #include directives,
we don't need to include tools/lib/bpf in the include path. Remove it to
make sure we don't inadvertently introduce new includes without the prefix.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/157952561027.1683545.1976265477926794138.stgit@toke.dk
4 years agosamples/bpf: Use consistent include paths for libbpf
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:49 +0000 (14:06 +0100)]
samples/bpf: Use consistent include paths for libbpf

Fix all files in samples/bpf to include libbpf header files with the bpf/
prefix, to be consistent with external users of the library. Also ensure
that all includes of exported libbpf header files (those that are exported
on 'make install' of the library) use bracketed includes instead of quoted.

To make sure no new files are introduced that doesn't include the bpf/
prefix in its include, remove tools/lib/bpf from the include path entirely,
and use tools/lib instead.

Fixes: 72ed068845b7 ("selftests/bpf: Ensure bpf_helper_defs.h are taken from selftests dir")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/157952560911.1683545.8795966751309534150.stgit@toke.dk
4 years agoperf: Use consistent include paths for libbpf
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:48 +0000 (14:06 +0100)]
perf: Use consistent include paths for libbpf

Fix perf to include libbpf header files with the bpf/ prefix, to
be consistent with external users of the library.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/157952560797.1683545.7685921032671386301.stgit@toke.dk
4 years agobpftool: Use consistent include paths for libbpf
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:46 +0000 (14:06 +0100)]
bpftool: Use consistent include paths for libbpf

Fix bpftool to include libbpf header files with the bpf/ prefix, to be
consistent with external users of the library. Also ensure that all
includes of exported libbpf header files (those that are exported on 'make
install' of the library) use bracketed includes instead of quoted.

To make sure no new files are introduced that doesn't include the bpf/
prefix in its include, remove tools/lib/bpf from the include path entirely,
and use tools/lib instead.

Fixes: 72ed068845b7 ("selftests/bpf: Ensure bpf_helper_defs.h are taken from selftests dir")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/157952560684.1683545.4765181397974997027.stgit@toke.dk
4 years agoselftests: Use consistent include paths for libbpf
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:45 +0000 (14:06 +0100)]
selftests: Use consistent include paths for libbpf

Fix all selftests to include libbpf header files with the bpf/ prefix, to
be consistent with external users of the library. Also ensure that all
includes of exported libbpf header files (those that are exported on 'make
install' of the library) use bracketed includes instead of quoted.

To not break the build, keep the old include path until everything has been
changed to the new one; a subsequent patch will remove that.

Fixes: 72ed068845b7 ("selftests/bpf: Ensure bpf_helper_defs.h are taken from selftests dir")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/157952560568.1683545.9649335788846513446.stgit@toke.dk
4 years agotools/runqslower: Use consistent include paths for libbpf
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:44 +0000 (14:06 +0100)]
tools/runqslower: Use consistent include paths for libbpf

Fix the runqslower tool to include libbpf header files with the bpf/
prefix, to be consistent with external users of the library. Also ensure
that all includes of exported libbpf header files (those that are exported
on 'make install' of the library) use bracketed includes instead of quoted.

To not break the build, keep the old include path until everything has been
changed to the new one; a subsequent patch will remove that.

Fixes: 72ed068845b7 ("selftests/bpf: Ensure bpf_helper_defs.h are taken from selftests dir")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/157952560457.1683545.9913736511685743625.stgit@toke.dk
4 years agoselftests: Pass VMLINUX_BTF to runqslower Makefile
Toke Høiland-Jørgensen [Mon, 20 Jan 2020 13:06:43 +0000 (14:06 +0100)]
selftests: Pass VMLINUX_BTF to runqslower Makefile

Add a VMLINUX_BTF variable with the locally-built path when calling the
runqslower Makefile from selftests. This makes sure a simple 'make'
invocation in the selftests dir works even when there is no BTF information
for the running kernel. Do a wildcard expansion and include the same paths
for BTF for the running kernel as in the runqslower Makefile, to make it
possible to build selftests without having a vmlinux in the local tree.

Also fix the make invocation to use $(OUTPUT)/tools as the destination
directory instead of $(CURDIR)/tools.

Fixes: 3b9284d76d96 ("selftests/bpf: Build runqslower from selftests")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/157952560344.1683545.2723631988771664417.stgit@toke.dk