git.baikalelectronics.ru Git

netlink: fix policy dump leak

[ Upstream commit dfa312730691bdcc402eb91c74137405b5765d8a ]

If userspace doesn't complete the policy dump, we leak the
allocated state. Fix this.

Fixes: 6ec88086e2e4 ("netlink: add infrastructure to expose policies to userspace")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

lib8390: Use netif_msg_init to initialize msg_enable bits

Use netif_msg_init() to process param settings
and use only the proper initialized value of
ei_local->msg_level for later processing;

Signed-off-by: Armin Wolf <W_Armin@gmx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: phy: realtek: Modify 2.5G PHY name to RTL8226

Realtek single-chip Ethernet PHY solutions can be separated as below:
10M/100Mbps: RTL8201X
1Gbps: RTL8211X
2.5Gbps: RTL8226/RTL8221X
RTL8226 is the first version for realtek that compatible 2.5Gbps single PHY.
Since RTL8226 is single port only, realtek changes its name to RTL8221B from
the second version.
PHY ID for RTL8226 is 0x001cc800 and RTL8226B/RTL8221B is 0x001cc840.

RTL8125 is not a single PHY solution, it integrates PHY/MAC/PCIE bus
controller and embedded memory.

Signed-off-by: Willy Liu <willy.liu@realtek.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

caif_virtio: Remove redundant initialization of variable err

After commit 7da1a769c190 ("caif_virtio: Check that vringh_config is not
null"), the variable err is being initialized with '-EINVAL' that is
meaningless. So remove it.

Signed-off-by: Jing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net-sysfs: Fix inconsistent of format with argument type in net-sysfs.c

Fix follow warnings:
[net/core/net-sysfs.c:1161]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'int'.
[net/core/net-sysfs.c:1162]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'int'.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

pktgen: Fix inconsistent of format with argument type in pktgen.c

Fix follow warnings:
[net/core/pktgen.c:925]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'signed int'.
[net/core/pktgen.c:942]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'signed int'.
[net/core/pktgen.c:962]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'signed int'.
[net/core/pktgen.c:984]: (warning) %u in format string (no. 1)
requires 'unsigned int' but the argument type is 'signed int'.
[net/core/pktgen.c:1149]: (warning) %d in format string (no. 1)
requires 'int' but the argument type is 'unsigned int'.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drivers/net/wan/hdlc_fr: Correctly handle special skb->protocol values

The fr_hard_header function is used to prepend the header to skbs before
transmission. It is used in 3 situations:
1) When a control packet is generated internally in this driver;
2) When a user sends an skb on an Ethernet-emulating PVC device;
3) When a user sends an skb on a normal PVC device.

These 3 situations need to be handled differently by fr_hard_header.
Different headers should be prepended to the skb in different situations.

Currently fr_hard_header distinguishes these 3 situations using
skb->protocol. For situation 1 and 2, a special skb->protocol value
will be assigned before calling fr_hard_header, so that it can recognize
these 2 situations. All skb->protocol values other than these special ones
are treated by fr_hard_header as situation 3.

However, it is possible that in situation 3, the user sends an skb with
one of the special skb->protocol values. In this case, fr_hard_header
would incorrectly treat it as situation 1 or 2.

This patch tries to solve this issue by using skb->dev instead of
skb->protocol to distinguish between these 3 situations. For situation
1, skb->dev would be NULL; for situation 2, skb->dev->type would be
ARPHRD_ETHER; and for situation 3, skb->dev->type would be ARPHRD_DLCI.

This way fr_hard_header would be able to distinguish these 3 situations
correctly regardless what skb->protocol value the user tries to use in
situation 3.

Cc: Krzysztof Halasa <khc@pm.waw.pl>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next

Daniel Borkmann says:

====================
pull-request: bpf-next 2020-10-01

The following pull-request contains BPF updates for your *net-next* tree.

We've added 90 non-merge commits during the last 8 day(s) which contain
a total of 103 files changed, 7662 insertions(+), 1894 deletions(-).

Note that once bpf(/net) tree gets merged into net-next, there will be a small
merge conflict in tools/lib/bpf/btf.c between commit 46dd7d1d0be1 ("libbpf: Fix
native endian assumption when parsing BTF") from the bpf tree and the commit
ae341790ea24 ("libbpf: Support BTF loading and raw data output in both endianness")
from the bpf-next tree. Correct resolution would be to stick with bpf-next, it
should look like:

  [...]
        /* check BTF magic */
        if (fread(&magic, 1, sizeof(magic), f) < sizeof(magic)) {
                err = -EIO;
                goto err_out;
        }
        if (magic != BTF_MAGIC && magic != bswap_16(BTF_MAGIC)) {
                /* definitely not a raw BTF */
                err = -EPROTO;
                goto err_out;
        }

        /* get file size */
  [...]

The main changes are:

1) Add bpf_snprintf_btf() and bpf_seq_printf_btf() helpers to support displaying
   BTF-based kernel data structures out of BPF programs, from Alan Maguire.

2) Speed up RCU tasks trace grace periods by a factor of 50 & fix a few race
   conditions exposed by it. It was discussed to take these via BPF and
   networking tree to get better testing exposure, from Paul E. McKenney.

3) Support multi-attach for freplace programs, needed for incremental attachment
   of multiple XDP progs using libxdp dispatcher model, from Toke Høiland-Jørgensen.

4) libbpf support for appending new BTF types at the end of BTF object, allowing
   intrusive changes of prog's BTF (useful for future linking), from Andrii Nakryiko.

5) Several BPF helper improvements e.g. avoid atomic op in cookie generator and add
   a redirect helper into neighboring subsys, from Daniel Borkmann.

6) Allow map updates on sockmaps from bpf_iter context in order to migrate sockmaps
   from one to another, from Lorenz Bauer.

7) Fix 32 bit to 64 bit assignment from latest alu32 bounds tracking which caused
   a verifier issue due to type downgrade to scalar, from John Fastabend.

8) Follow-up on tail-call support in BPF subprogs which optimizes x64 JIT prologue
   and epilogue sections, from Maciej Fijalkowski.

9) Add an option to perf RB map to improve sharing of event entries by avoiding remove-
   on-close behavior. Also, add BPF_PROG_TEST_RUN for raw_tracepoint, from Song Liu.

10) Fix a crash in AF_XDP's socket_release when memory allocation for UMEMs fails,
    from Magnus Karlsson.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'net-ravb-Add-support-for-explicit-internal-clock-delay-c
onfiguration'

Geert Uytterhoeven says:

====================
net/ravb: Add support for explicit internal clock delay configuration

Some Renesas EtherAVB variants support internal clock delay
configuration, which can add larger delays than the delays that are
typically supported by the PHY (using an "rgmii-*id" PHY mode, and/or
"[rt]xc-skew-ps" properties).

Historically, the EtherAVB driver configured these delays based on the
"rgmii-*id" PHY mode.  This caused issues with PHY drivers that
implement PHY internal delays properly[1].  Hence a backwards-compatible
workaround was added by masking the PHY mode[2].

This patch series implements the next step of the plan outlined in [3],
and adds proper support for explicit configuration of the MAC internal
clock delays using new "[rt]x-internal-delay-ps" properties.  If none of
these properties is present, the driver falls back to the old handling.

This can be considered the MAC counterpart of commit db44e3751ab7d089
("dt-bindings: net: Add tx and rx internal delays"), which applies to
the PHY.  Note that unlike commit 4e923ab21eb7b976 ("net: phy: Add a
helper to return the index for of the internal delay"), no helpers are
provided to parse the DT properties, as so far there is a single user
only, which supports only zero or a single fixed value.  Of course such
helpers can be added later, when the need arises, or when deemed useful
otherwise.

This series consists of 3 parts:
  1. DT binding updates documenting the new properties, for both the
     generic ethernet-controller and the EtherAVB-specific bindings,
  2. Conversion to json-schema of the Renesas EtherAVB DT bindings.
     Technically, the conversion is independent of all of the above.
     I included it in this series, as it shows how all sanity checks on
     "[rt]x-internal-delay-ps" values are implemented as DT binding
     checks,
  3. EtherAVB driver update implementing support for the new properties.

Given Rob has provided his acks for the DT binding updates, all of this
can be merged through net-next.

Changes compared to v3[4]:
  - Add Reviewed-by,
  - Drop the DT updates, as they will be merged through renesas-devel and
    arm-soc, and have a hard dependency on this series.

Changes compared to v2[5]:
  - Update recently added board DTS files,
  - Add Reviewed-by.

Changes compared to v1[6]:
  - Added "[PATCH 1/7] dt-bindings: net: ethernet-controller: Add
    internal delay properties",
  - Replace "renesas,[rt]xc-delay-ps" by "[rt]x-internal-delay-ps",
  - Incorporated EtherAVB DT binding conversion to json-schema,
  - Add Reviewed-by.

Impacted, tested:
  - Salvator-X(S) with R-Car H3 ES1.0 and ES2.0, M3-W, and M3-N.

Not impacted, tested:
  - Ebisu with R-Car E3.

Impacted, not tested:
  - Salvator-X(S) with other SoC variants,
  - ULCB with R-Car H3/M3-W/M3-N variants,
  - V3MSK and Eagle with R-Car V3M,
  - Draak with R-Car V3H,
  - HiHope RZ/G2[MN] with RZ/G2M or RZ/G2N,
  - Beacon EmbeddedWorks RZ/G2M Development Kit.

To ease testing, I have pushed this series and the DT updates to the
topic/ravb-internal-clock-delays-v4 branch of my renesas-drivers
repository at
git://git.kernel.org/pub/scm/linux/kernel/git/geert/renesas-drivers.git.

Thanks for applying!

References:
  [1] Commit 4570040da42c24b9 ("net: phy: micrel: add phy-mode support
      for the KSZ9031 PHY")
  [2] Commit c470071643c4a9a9 ("ravb: Mask PHY mode to avoid inserting
      delays twice").
      https://lore.kernel.org/r/20200529122540.31368-1-geert+renesas@glider.be/
  [3] https://lore.kernel.org/r/CAMuHMdU+MR-2tr3-pH55G0GqPG9HwH3XUd=8HZxprFDMGQeWUw@mail.gmail.com/
  [4] https://lore.kernel.org/linux-devicetree/20200819134344.27813-1-geert+renesas@glider.be/
  [5] https://lore.kernel.org/linux-devicetree/20200706143529.18306-1-geert+renesas@glider.be/
  [6] https://lore.kernel.org/linux-devicetree/20200619191554.24942-1-geert+renesas@glider.be/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Add support for explicit internal clock delay configuration

Some EtherAVB variants support internal clock delay configuration, which
can add larger delays than the delays that are typically supported by
the PHY (using an "rgmii-*id" PHY mode, and/or "[rt]xc-skew-ps"
properties).

Historically, the EtherAVB driver configured these delays based on the
"rgmii-*id" PHY mode.  This caused issues with PHY drivers that
implement PHY internal delays properly[1].  Hence a backwards-compatible
workaround was added by masking the PHY mode[2].

Add proper support for explicit configuration of the MAC internal clock
delays using the new "[rt]x-internal-delay-ps" properties.
Fall back to the old handling if none of these properties is present.

[1] Commit 4570040da42c24b9 ("net: phy: micrel: add phy-mode support for
    the KSZ9031 PHY")
[2] Commit c470071643c4a9a9 ("ravb: Mask PHY mode to avoid inserting
    delays twice").

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

ravb: Split delay handling in parsing and applying

Currently, full delay handling is done in both the probe and resume
paths. Split it in two parts, so the resume path doesn't have to redo
the parsing part over and over again.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: renesas,etheravb: Convert to json-schema

Convert the Renesas Ethernet AVB (EthernetAVB-IF) Device Tree binding
documentation to json-schema.

Add missing properties.
Update the example to match reality.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: renesas,ravb: Document internal clock delay properties

Some EtherAVB variants support internal clock delay configuration, which
can add larger delays than the delays that are typically supported by
the PHY (using an "rgmii-*id" PHY mode, and/or "[rt]xc-skew-ps"
properties).

Add properties for configuring the internal MAC delays.
These properties are mandatory, even when specified as zero, to
distinguish between old and new DTBs.

Update the (bogus) example accordingly.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Sergei Shtylyov <sergei.shtylyov@gmail.com>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

dt-bindings: net: ethernet-controller: Add internal delay properties

Internal Receive and Transmit Clock Delays are a common setting for
RGMII capable devices.

While these delays are typically applied by the PHY, some MACs support
configuring internal clock delay settings, too. Hence add standardized
properties to configure this.

This is the MAC counterpart of commit db44e3751ab7d089 ("dt-bindings:
net: Add tx and rx internal delays"), which applies to the PHY.

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Rob Herring <robh@kernel.org>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-updates-2020-09-30' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-09-30

Updates and cleanups for mlx5 driver:

1) From Ariel, Dan Carpenter and Gostavo, Fixes to the previous
   mlx5 Connection track series.

2) From Yevgeny, trivial cleanups for Software steering

3) From Hamdan, Support for Flow source hint in software steering and
   E-Switch

4) From Parav and Sunil, Small and trivial E-Switch updates and
   cleanups in preparation for mlx5 Sub-functions support
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'introduce BPF_F_PRESERVE_ELEMS'

Song Liu says:

====================
This set introduces BPF_F_PRESERVE_ELEMS to perf event array for better
sharing of perf event. By default, perf event array removes the perf event
when the map fd used to add the event is closed. With BPF_F_PRESERVE_ELEMS
set, however, the perf event will stay in the array until it is removed, or
the map is closed.
---
Changes v3 => v5:
1. Clean up in selftest. (Alexei)

Changes v2 => v3:
1. Move perf_event_fd_array_map_free() to avoid unnecessary forward
declaration. (Daniel)

Changes v1 => v2:
1. Rename the flag as BPF_F_PRESERVE_ELEMS. (Alexei, Daniel)
2. Simplify the code and selftest. (Daniel, Alexei)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

selftests/bpf: Add tests for BPF_F_PRESERVE_ELEMS

Add tests for perf event array with and without BPF_F_PRESERVE_ELEMS.

Add a perf event to array via fd mfd. Without BPF_F_PRESERVE_ELEMS, the
perf event is removed when mfd is closed. With BPF_F_PRESERVE_ELEMS, the
perf event is removed when the map is freed.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200930224927.1936644-3-songliubraving@fb.com

bpf: Introduce BPF_F_PRESERVE_ELEMS for perf event array

Currently, perf event in perf event array is removed from the array when
the map fd used to add the event is closed. This behavior makes it
difficult to the share perf events with perf event array.

Introduce perf event map that keeps the perf event open with a new flag
BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not
removed when the original map fd is closed. Instead, the perf event will
stay in the map until 1) it is explicitly removed from the array; or 2)
the array is freed.

Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200930224927.1936644-2-songliubraving@fb.com

net/mlx5e: Fix potential null pointer dereference

Calls to kzalloc() and kvzalloc() should be null-checked
in order to avoid any potential failures. In this case,
a potential null pointer dereference.

Fix this by adding null checks for _parse_attr_ and _flow_
right after allocation.

Addresses-Coverity-ID: 1497154 ("Dereference before null check")
Fixes: a650fb5166d6 ("net/mlx5: Refactor tc flow attributes structure")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5e: Fix a use after free on error in mlx5_tc_ct_shared_counter_get()

This code frees "shared_counter" and then dereferences on the next line
to get the error code.

Fixes: da404640cb59 ("net/mlx5e: CT: Use the same counter for both directions")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Fix dereference on pointer attr after null check

When removing a flow from the slow path fdb, a flow attr struct is
allocated for the rule removal process. If the allocation fails the
code prints a warning message but continues with the removal flow
which include dereferencing a pointer which could be null.
Fix this by exiting the function in case the attr allocation failed.

Fixes: a650fb5166d6 ("net/mlx5: Refactor tc flow attributes structure")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ariel Levkovich <lariel@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: Use dma device access helper

Use the PCI device directly for dma accesses as non PCI device unlikely
support IOMMU and dma mappings.
Introduce and use helper routine to access DMA device.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: E-Switch, Support flow source for local vport

Set flow source as hint for local vport.

Signed-off-by: Hamdan Igbaria <hamdani@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: E-switch, Move devlink eswitch ports closer to eswitch

Currently devlink eswitch ports are registered and unregistered by the
representor layer.
However it is better to register them at eswitch layer so that in future
user initiated command port add and delete commands can also
register/unregister devlink ports without depending on representor layer.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: E-switch, Use helper function to load unload representor

To register and unregister devlink ports when loading/unload
representors, refactor the code to helper functions.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: E-switch, Add helper to check egress ACL need

Currently only VF vports need egress ACL table.
Add a generic helper to check whether a vport need egress
ACL table or not.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Roi Dayan <roid@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: E-switch, Use PF num in metadata reg c0

Currently only 256 vports can be supported as only 8 bits are
reserved for them and 8 bits are reserved for vhca_ids in
metadata reg c0. To support more than 256 vports, replace
vhca_id with a unique shorter 4-bit PF number which covers
upto 16 PF's. Use remaining 12 bits for vports ranging 1-4095.
This will continue to generate unique metadata even if
multiple PCI devices have same switch_id.

Signed-off-by: sunils <sunils@nvidia.com>
Reviewed-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Vu Pham <vuhuong@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Add support for rule creation with flow source hint

Skip the rule according to flow arrival source, in case of RX and the
source is local port skip and in case of TX and the source is uplink
skip, we get this info according to the flow source hint we get from
upper layers when creating the rule.
This is needed because for example in case of FDB table which has a TX
and RX tables and we are inserting a rule with an encap action which
is only a TX action, in this case rule will fail on RX, so we can rely
on the flow source hint and skip RX in such case.
Until now we relied on metadata regc_0 that upper layer mapped the
port in the regc_0, but the problem is that upper layer did not always
use regc_0 for port mapping, so now we added support to flow source
hint which upper layers will pass to SW steering when creating a rule.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Hamdan Igbaria <hamdani@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Call ste_builder directly with tag pointer

Instead of getting the tag in each function, call the builder
directly with the tag. This will allow to use the same function
for building the tag and the bitmask.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Remove unneeded local variable

The misc3 variable is used only once and can be dropped.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Remove unneeded vlan check from L2 builder

When we create a matcher we check that all fields are consumed.
There is no need for this specific check. This keeps the STE
builder functions simple and clean.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Remove unneeded check from source port builder

Mask validity for ste builders is checked by mlx5dr_ste_build_pre_check
during matcher creation.
It already checks the mask value of source_vport, so removing
this duplicated check.
Also, moving there the check of source_eswitch_owner_vhca_id mask.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

net/mlx5: DR, Replace the check for valid STE entry

Validity check is done by reading the next lu_type from the STE,
this check can be replaced by checking the refcount.
This will make the check independent on internal STE structure.

Signed-off-by: Alex Vesker <valex@nvidia.com>
Signed-off-by: Yevgeny Kliteynik <kliteyn@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>

selftests/bpf: Fix alignment of .BTF_ids

Fix a build failure on arm64, due to missing alignment information for
the .BTF_ids section:

resolve_btfids.test.o: in function `test_resolve_btfids':
tools/testing/selftests/bpf/prog_tests/resolve_btfids.c:140:(.text+0x29c): relocation truncated to fit: R_AARCH64_LDST32_ABS_LO12_NC against `.BTF_ids'
ld: tools/testing/selftests/bpf/prog_tests/resolve_btfids.c:140: warning: one possible cause of this error is that the symbol is being referenced in the indicated code as if it had a larger alignment than was declared where it was defined

In vmlinux, the .BTF_ids section is aligned to 4 bytes by vmlinux.lds.h.
In test_progs however, .BTF_ids doesn't have alignment constraints. The
arm64 linker expects the btf_id_set.cnt symbol, a u32, to be naturally
aligned but finds it misaligned and cannot apply the relocation. Enforce
alignment of .BTF_ids to 4 bytes.

Fixes: 05462dbf47ed ("selftests/bpf: Add set test to resolve_btfids")
Signed-off-by: Jean-Philippe Brucker <jean-philippe@linaro.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jiri Olsa <jolsa@redhat.com>
Link: https://lore.kernel.org/bpf/20200930093559.2120126-1-jean-philippe@linaro.org

Merge branch 'drop_monitor-Convert-to-use-devlink-tracepoint'

Ido Schimmel says:

====================
drop_monitor: Convert to use devlink tracepoint

Drop monitor is able to monitor both software and hardware originated
drops. Software drops are monitored by having drop monitor register its
probe on the 'kfree_skb' tracepoint. Hardware originated drops are
monitored by having devlink call into drop monitor whenever it receives
a dropped packet from the underlying hardware.

This patch set converts drop monitor to monitor both software and
hardware originated drops in the same way - by registering its probe on
the relevant tracepoint.

In addition to drop monitor being more consistent, it is now also
possible to build drop monitor as module instead of as a builtin and
still monitor hardware originated drops. Initially, CONFIG_NET_DEVLINK
implied CONFIG_NET_DROP_MONITOR, but after commit 2f5a0f65c9f7
("kconfig: allow symbols implied by y to become m") we can have
CONFIG_NET_DEVLINK=y and CONFIG_NET_DROP_MONITOR=m and hardware
originated drops will not be monitored.

Patch set overview:

Patch #1 adds a tracepoint in devlink for trap reports.

Patch #2 prepares probe functions in drop monitor for the new
tracepoint.

Patch #3 converts drop monitor to use the new tracepoint.

Patches #4-#6 perform cleanups after the conversion.

Patch #7 adds a test case for drop monitor. Both software originated
drops and hardware originated drops (using netdevsim) are tested.

Tested:

| CONFIG_NET_DEVLINK | CONFIG_NET_DROP_MONITOR | Build | SW drops | HW drops |
| -------------------|-------------------------|-------|----------|----------|
|          y         |            y            |   v   |     v    |     v    |
|          y         |            m            |   v   |     v    |     v    |
|          y         |            n            |   v   |     x    |     x    |
|          n         |            y            |   v   |     v    |     x    |
|          n         |            m            |   v   |     v    |     x    |
|          n         |            n            |   v   |     x    |     x    |
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: net: Add drop monitor test

Test that drop monitor correctly captures both software and hardware
originated packet drops.

# ./drop_monitor_tests.sh

Software drops test
    TEST: Capturing active software drops                               [ OK ]
    TEST: Capturing inactive software drops                             [ OK ]

Hardware drops test
    TEST: Capturing active hardware drops                               [ OK ]
    TEST: Capturing inactive hardware drops                             [ OK ]

Tests passed:   4
Tests failed:   0

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Filter control packets in drop monitor

Previously, devlink called into drop monitor in order to report hardware
originated drops / exceptions. devlink intentionally filtered control
packets and did not pass them to drop monitor as they were not dropped
by the underlying hardware.

Now drop monitor registers its probe on a generic 'devlink_trap_report'
tracepoint and should therefore perform this filtering itself instead of
having devlink do that.

Add the trap type as metadata and have drop monitor ignore control
packets.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Remove duplicate struct

'struct net_dm_hw_metadata' is a duplicate of 'struct
devlink_trap_metadata'.

Remove the former and simplify the code.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Remove no longer used functions

The old probe functions that were invoked by drop monitor code are no
longer called and can thus be removed. They were replaced by actual
probe functions that are registered on the recently introduced
'devlink_trap_report' tracepoint.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Convert to using devlink tracepoint

Convert drop monitor to use the recently introduced
'devlink_trap_report' tracepoint instead of having devlink call into
drop monitor.

This is both consistent with software originated drops ('kfree_skb'
tracepoint) and also allows drop monitor to be built as a module and
still report hardware originated drops.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

drop_monitor: Prepare probe functions for devlink tracepoint

Drop monitor supports two alerting modes: Summary and packet. Prepare a
probe function for each, so that they could be later registered on the
devlink tracepoint by calling register_trace_devlink_trap_report(),
based on the configured alerting mode.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

devlink: Add a tracepoint for trap reports

Add a tracepoint for trap reports so that drop monitor could register
its probe on it. Use trace_devlink_trap_report_enabled() to avoid
wasting cycles setting the trap metadata if the tracepoint is not
enabled.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'linux-can-next-for-5.10-20200930' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2020-09-30

this is a pull request of 13 patches for net-next.

The first 10 target the mcp25xxfd driver (which is renamed to mcp251xfd during
this series).

The first two patches are by Thomas Kopp, which adds reference to the just
related errata and updates the documentation and log messages.

Dan Carpenter's patch fixes a resource leak during ifdown.

A patch by me adds the missing initialization of a variable.

Oleksij Rempel updates the DT binding documentation as requested by Rob
Herring.

The next 5 patches are by Thomas Kopp and me. During review Geert Uytterhoeven
suggested to use "microchip,mcp251xfd" instead of "microchip,mcp25xxfd" as the
DT autodetection compatible to avoid clashes with future but incompatible
devices. We decided not only to rename the compatible but the whole driver from
"mcp25xxfd" to "mcp251xfd". This is done in several patches.

Joakim Zhang contributes three patches for the flexcan driver. The first one
adds support for the ECC feature, which is implemented on some modern IP cores,
by initializing the controller's memory during ifup. The next patch adds
support for the i.MX8MP (which supports ECC) and the last patch properly
disables the runtime PM if device registration fails.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'ionic-watchdog-training'

Shannon Nelson says:

====================
ionic watchdog training

Our link watchdog displayed a couple of unfriendly behaviors in some recent
stress testing. These patches change the startup and stop timing in order
to be sure that expected structures are ready to be used by the watchdog.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: prevent early watchdog check

In one corner case scenario, the driver device lif setup can
get delayed such that the ionic_watchdog_cb() timer goes off
before the ionic->lif is set, thus causing a NULL pointer panic.
We catch the problem by checking for a NULL lif just a little
earlier in the callback.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

ionic: stop watchdog timer earlier on remove

We need to be better at making sure we don't have a link check
watchdog go off while we're shutting things down, so let's stop
the timer as soon as we start the remove.

Meanwhile, since that was the only thing in
ionic_dev_teardown(), simplify and remove that function.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'tcp-exponential-backoff-in-tcp_send_ack'

Eric Dumazet says:

====================
tcp: exponential backoff in tcp_send_ack()

We had outages caused by repeated skb allocation failures in tcp_send_ack()

It is time to add exponential backoff to reduce number of attempts.
Before doing so, first patch removes icsk_ack.blocked to make
room for a new field (icsk_ack.retry)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

tcp: add exponential backoff in __tcp_send_ack()

Whenever host is under very high memory pressure,
__tcp_send_ack() skb allocation fails, and we setup
a 200 ms (TCP_DELACK_MAX) timer before retrying.

On hosts with high number of TCP sockets, we can spend
considerable amount of cpu cycles in these attempts,
add high pressure on various spinlocks in mm-layer,
ultimately blocking threads attempting to free space
from making any progress.

This patch adds standard exponential backoff to avoid
adding fuel to the fire.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

inet: remove icsk_ack.blocked

TCP has been using it to work around the possibility of tcp_delack_timer()
finding the socket owned by user.

After commit f02b41be228f ("tcp: improve latencies of timer triggered events")
we added TCP_DELACK_TIMER_DEFERRED atomic bit for more immediate recovery,
so we can get rid of icsk_ack.blocked

This frees space that following patch will reuse.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: macb: move pdata to private header

struct macb_platform_data is only used by macb_pci to register the platform
device, move its definition to cadence/macb.h and remove platform_data/macb.h

Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge branch 'mlxsw-PFC-and-headroom-selftests'

Petr Machata says:

====================
mlxsw: PFC and headroom selftests

Recent changes in the headroom management code made it clear that an
automated way of testing this functionality is needed. This patchset brings
two tests: a synthetic headroom behavior test, which verifies mechanics of
headroom management. And a PFC test, which verifies whether this behavior
actually translates into a working lossless configuration.

Both of these tests rely on mlnx_qos[1], a tool that interfaces with Linux
DCB API. The tool was originally written to work with Mellanox NICs, but
does not actually rely on anything Mellanox-specific, and can be used for
mlxsw as well as for any other NIC-like driver. Unlike Open LLDP it does
support buffer commands and permits a fire-and-forget approach to
configuration, which makes it very handy for writing of selftests.

Patches #1-#3 extend the selftest devlink_lib.sh in various ways. Patch #4
then adds a helper wrapper for mlnx_qos to mlxsw's qos_lib.sh.

Patch #5 adds a test for management of port headroom.

Patch #6 adds a PFC test.

[1] https://github.com/Mellanox/mlnx-tools/
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Add a PFC test

Add a test for PFC. Runs 10MB of traffic through a bottleneck and checks
that none of it gets lost.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: Add headroom handling test

Add a test for headroom configuration. This covers projection of ETS
configuration to ingress, PFC, adjustments for MTU, the qdisc / TC
mode and the effect of egress SPAN session on buffer configuration.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mlxsw: qos_lib: Add a wrapper for running mlnx_qos

mlnx_qos is a script for configuration of DCB. Despite the name it is not
actually Mellanox-specific in any way. It is currently the only ad-hoc tool
available (in contrast to a daemon that manages an interface on an ongoing
basis). However, it is very verbose and parsing out error messages is not
really possible. Add a wrapper that makes it easier to use the tool.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: devlink_lib: Support port-less topologies

Some selftests may not need any actual ports. Technically those are not
forwarding selftests, but devlink_lib can still be handy. Fall back on
NETIF_NO_CABLE in those cases.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: devlink_lib: Add devlink_cell_size_get()

Add a helper that answers the cell size of the devlink device.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: forwarding: devlink_lib: Split devlink_..._set() into save & set

Changing pool type from static to dynamic causes reinterpretation of
threshold values. They therefore need to be saved before pool type is
changed, then the pool type can be changed, and then the new values need
to be set up.

For that reason, set cannot subsume save, because it would be saving the
wrong thing, with possibly a nonsensical value, and restore would then fail
to restore the nonsensical value.

Thus extract a _save() from each of the relevant _set()'s. This way it is
possible to save everything up front, then to tweak it, and then restore in
the required order.

Signed-off-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

can: flexcan: disable runtime PM if register flexcandev failed

Disable runtime PM if register flexcandev failed, and balance reference
of usage_count.

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Link: https://lore.kernel.org/r/20200929211557.14153-4-qiangqing.zhang@nxp.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: add flexcan driver for i.MX8MP

Add flexcan driver for i.MX8MP, which supports CAN FD and ECC.

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Link: https://lore.kernel.org/r/20200929211557.14153-3-qiangqing.zhang@nxp.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: initialize all flexcan memory for ECC function

One issue was reported at a baremetal environment, which is used for
FPGA verification. "The first transfer will fail for extended ID
format(for both 2.0B and FD format), following frames can be transmitted
and received successfully for extended format, and standard format don't
have this issue. This issue occurred randomly with high possiblity, when
it occurs, the transmitter will detect a BIT1 error, the receiver a CRC
error. According to the spec, a non-correctable error may cause this
transfer failure."

With FLEXCAN_QUIRK_DISABLE_MECR quirk, it supports correctable errors,
disable non-correctable errors interrupt and freeze mode. Platform has
ECC hardware support, but select this quirk, this issue may not come to
light. Initialize all FlexCAN memory before accessing them, at least it
can avoid non-correctable errors detected due to memory uninitialized.
The internal region can't be initialized when the hardware doesn't support
ECC.

According to IMX8MPRM, Rev.C, 04/2020. There is a NOTE at the section
11.8.3.13 Detection and correction of memory errors:
"All FlexCAN memory must be initialized before starting its operation in
order to have the parity bits in memory properly updated. CTRL2[WRMFRZ]
grants write access to all memory positions that require initialization,
ranging from 0x080 to 0xADF and from 0xF28 to 0xFFF when the CAN FD feature
is enabled. The RXMGMASK, RX14MASK, RX15MASK, and RXFGMASK registers need to
be initialized as well. MCR[RFEN] must not be set during memory initialization."

Memory range from 0x080 to 0xADF, there are reserved memory (unimplemented
by hardware, e.g. only configure 64 MBs), these memory can be initialized or not.
In this patch, initialize all flexcan memory which includes reserved memory.

In this patch, create FLEXCAN_QUIRK_SUPPORT_ECC for platforms which has ECC
feature. If you have a ECC platform in your hand, please select this
qurik to initialize all flexcan memory firstly, then you can select
FLEXCAN_QUIRK_DISABLE_MECR to only enable correctable errors.

Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com>
Link: https://lore.kernel.org/r/20200929211557.14153-2-qiangqing.zhang@nxp.com
[mkl: wrap long lines]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: rename all remaining occurrence to mcp251xfd

In [1] Geert noted that the autodetect compatible for the mcp25xxfd driver,
which is "microchip,mcp25xxfd" might be too generic and overlap with upcoming,
but incompatible chips.

In the previous patch the autodetect DT compatbile has been renamed to
"microchip,mcp251xfd", this patch changes all non user facing occurrence of
"mcp25xxfd" to "mcp251xfd" and "MCP25XXFD" to "MCP251XFD".

[1] http://lore.kernel.org/r/CAMuHMdVkwGjr6dJuMyhQNqFoJqbh6Ec5V2b5LenCshwpM2SDsQ@mail.gmail.com

Link: https://lore.kernel.org/r/20200930091424.792165-10-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: rename all user facing strings to mcp251xfd

In [1] Geert noted that the autodetect compatible for the mcp25xxfd driver,
which is "microchip,mcp25xxfd" might be too generic and overlap with upcoming,
but incompatible chips.

In the previous patch the autodetect DT compatbile has been renamed to
"microchip,mcp251xfd", this patch changes all user facing strings from
"mcp25xxfd" to "mcp251xfd" and "MCP25XXFD" to "MCP251XFD", including:
- kconfig symbols
- name of kernel module
- DT and SPI compatible

[1] http://lore.kernel.org/r/CAMuHMdVkwGjr6dJuMyhQNqFoJqbh6Ec5V2b5LenCshwpM2SDsQ@mail.gmail.com

Link: https://lore.kernel.org/r/20200930091424.792165-9-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: rename driver files and subdir to mcp251xfd

In [1] Geert noted that the autodetect compatible for the mcp25xxfd driver,
which is "microchip,mcp25xxfd" might be too generic and overlap with upcoming,
but incompatible chips.

In the previous patch the autodetect DT compatbile has been renamed to
"microchip,mcp251xfd", this patch changes the name of the driver subdir and the
individual files accordinly.

[1] http://lore.kernel.org/r/CAMuHMdVkwGjr6dJuMyhQNqFoJqbh6Ec5V2b5LenCshwpM2SDsQ@mail.gmail.com

Link: https://lore.kernel.org/r/20200930091424.792165-8-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

selftests/bpf: Test "incremental" btf_dump in C format

Add test validating that btf_dump works fine with BTFs that are modified and
incrementally generated.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200929232843.1249318-5-andriin@fb.com

libbpf: Make btf_dump work with modifiable BTF

Ensure that btf_dump can accommodate new BTF types being appended to BTF
instance after struct btf_dump was created. This came up during attemp to
use btf_dump for raw type dumping in selftests, but given changes are not
excessive, it's good to not have any gotchas in API usage, so I decided to
support such use case in general.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200929232843.1249318-2-andriin@fb.com

Merge branch 'Various BPF helper improvements'

Daniel Borkmann says:

====================
This series adds two BPF helpers, that is, one for retrieving the classid
of an skb and another one to redirect via the neigh subsystem, and improves
also the cookie helpers by removing the atomic counter. I've also added
the bpf_tail_call_static() helper to the libbpf API that we've been using
in Cilium for a while now, and last but not least the series adds a few
selftests. For details, please check individual patches, thanks!

v3 -> v4:
  - Removed out_rec error path (Martin)
  - Integrate BPF_F_NEIGH flag into rejecting invalid flags (Martin)
    - I think this way it's better to avoid bit overlaps given it's
      right in the place that would need to be extended on new flags
v2 -> v3:
  - Removed double skb->dev = dev assignment (David)
  - Added headroom check for v6 path (David)
  - Set set flowi4_proto for ip_route_output_flow (David)
  - Rebased onto latest bpf-next
v1 -> v2:
  - Rework cookie generator to support nested contexts (Eric)
  - Use ip_neigh_gw6() and container_of() (David)
  - Rename __throw_build_bug() and improve comments (Andrii)
  - Use bpf_tail_call_static() also in BPF samples (Maciej)
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf, selftests: Add redirect_neigh selftest

Add a small test that exercises the new redirect_neigh() helper for the
IPv4 and IPv6 case.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/0fc7d9c5f9a6cc1c65b0d3be83b44b1ec9889f43.1601477936.git.daniel@iogearbox.net

bpf, selftests: Use bpf_tail_call_static where appropriate

For those locations where we use an immediate tail call map index use the
newly added bpf_tail_call_static() helper.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/3cfb2b799a62d22c6e7ae5897c23940bdcc24cbc.1601477936.git.daniel@iogearbox.net

bpf, libbpf: Add bpf_tail_call_static helper for bpf programs

Port of tail_call_static() helper function from Cilium's BPF code base [0]
to libbpf, so others can easily consume it as well. We've been using this
in production code for some time now. The main idea is that we guarantee
that the kernel's BPF infrastructure and JIT (here: x86_64) can patch the
JITed BPF insns with direct jumps instead of having to fall back to using
expensive retpolines. By using inline asm, we guarantee that the compiler
won't merge the call from different paths with potentially different
content of r2/r3.

We're also using Cilium's __throw_build_bug() macro (here as: __bpf_unreachable())
in different places as a neat trick to trigger compilation errors when
compiler does not remove code at compilation time. This works for the BPF
back end as it does not implement the __builtin_trap().

[0] https://github.com/cilium/cilium/commit/f5537c26020d5297b70936c6b7d03a1e412a1035

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/1656a082e077552eb46642d513b4a6bde9a7dd01.1601477936.git.daniel@iogearbox.net

bpf: Add redirect_neigh helper as redirect drop-in

Add a redirect_neigh() helper as redirect() drop-in replacement
for the xmit side. Main idea for the helper is to be very similar
in semantics to the latter just that the skb gets injected into
the neighboring subsystem in order to let the stack do the work
it knows best anyway to populate the L2 addresses of the packet
and then hand over to dev_queue_xmit() as redirect() does.

This solves two bigger items: i) skbs don't need to go up to the
stack on the host facing veth ingress side for traffic egressing
the container to achieve the same for populating L2 which also
has the huge advantage that ii) the skb->sk won't get orphaned in
ip_rcv_core() when entering the IP routing layer on the host stack.

Given that skb->sk neither gets orphaned when crossing the netns
as per ad414b1000c3 ("skbuff: preserve sock reference when scrubbing
the skb.") the helper can then push the skbs directly to the phys
device where FQ scheduler can do its work and TCP stack gets proper
backpressure given we hold on to skb->sk as long as skb is still
residing in queues.

With the helper used in BPF data path to then push the skb to the
phys device, I observed a stable/consistent TCP_STREAM improvement
on veth devices for traffic going container -> host -> host ->
container from ~10Gbps to ~15Gbps for a single stream in my test
environment.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: David Ahern <dsahern@kernel.org>
Link: https://lore.kernel.org/bpf/f207de81629e1724899b73b8112e0013be782d35.1601477936.git.daniel@iogearbox.net

bpf, net: Rework cookie generator as per-cpu one

With its use in BPF, the cookie generator can be called very frequently
in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
and attached to the root cgroup, for example, when used in v1/v2 mixed
environments. In particular, when there's a high churn on sockets in the
system there can be many parallel requests to the bpf_get_socket_cookie()
and bpf_get_netns_cookie() helpers which then cause contention on the
atomic counter.

As similarly done in f68ef4d382fe ("fs: introduce a per-cpu last_ino
allocator"), add a small helper library that both can use for the 64 bit
counters. Given this can be called from different contexts, we also need
to deal with potential nested calls even though in practice they are
considered extremely rare. One idea as suggested by Eric Dumazet was
to use a reverse counter for this situation since we don't expect 64 bit
overflows anyways; that way, we can avoid bigger gaps in the 64 bit
counter space compared to just batch-wise increase. Even on machines
with small number of cores (e.g. 4) the cookie generation shrinks from
min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
in parallel from multiple CPUs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net

bpf: Add classid helper only based on skb->sk

Similarly to e88ef305d9d6 ("bpf: Allow to retrieve cgroup v1 classid
from v2 hooks"), add a helper to retrieve cgroup v1 classid solely
based on the skb->sk, so it can be used as key as part of BPF map
lookups out of tc from host ns, in particular given the skb->sk is
retained these days when crossing net ns thanks to ad414b1000c3
("skbuff: preserve sock reference when scrubbing the skb."). This
is similar to bpf_skb_cgroup_id() which implements the same for v2.
Kubernetes ecosystem is still operating on v1 however, hence net_cls
needs to be used there until this can be dropped in with the v2
helper of bpf_skb_cgroup_id().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/ed633cf27a1c620e901c5aa99ebdefb028dce600.1601477936.git.daniel@iogearbox.net

bpf: fix raw_tp test run in preempt kernel

In preempt kernel, BPF_PROG_TEST_RUN on raw_tp triggers:

[   35.874974] BUG: using smp_processor_id() in preemptible [00000000]
code: new_name/87
[   35.893983] caller is bpf_prog_test_run_raw_tp+0xd4/0x1b0
[   35.900124] CPU: 1 PID: 87 Comm: new_name Not tainted 5.9.0-rc6-g615bd02bf #1
[   35.907358] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.10.2-1ubuntu1 04/01/2014
[   35.916941] Call Trace:
[   35.919660]  dump_stack+0x77/0x9b
[   35.923273]  check_preemption_disabled+0xb4/0xc0
[   35.928376]  bpf_prog_test_run_raw_tp+0xd4/0x1b0
[   35.933872]  ? selinux_bpf+0xd/0x70
[   35.937532]  __do_sys_bpf+0x6bb/0x21e0
[   35.941570]  ? find_held_lock+0x2d/0x90
[   35.945687]  ? vfs_write+0x150/0x220
[   35.949586]  do_syscall_64+0x2d/0x40
[   35.953443]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by calling migrate_disable() before smp_processor_id().

Fixes: ad35d2e77169 ("bpf: Enable BPF_PROG_TEST_RUN for raw_tracepoint")
Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>

can: mcp25xxfd: narrow down wildcards in device tree bindings to "microchip,mcp251xfd"

The wildcard should be narrowed down to prevent existing and future devices
that are not compatible from matching. It is very unlikely that incompatible
devices will be released that do not match the wildcard.

Discussion Reference: https://lore.kernel.org/r/CAMuHMdVkwGjr6dJuMyhQNqFoJqbh6Ec5V2b5LenCshwpM2SDsQ@mail.gmail.com

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Kopp <thomas.kopp@microchip.com>
Link: https://lore.kernel.org/r/20200930091423.755-1-thomas.kopp@microchip.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

dt-binding: can: mcp251xfd: narrow down wildcards in device tree bindings to "microchip,mcp251xfd"

The wildcard should be narrowed down to prevent existing and future devices
that are not compatible from matching. It is very unlikely that incompatible
devices will be released that do not match the wildcard.

This is the documentation part of the commit.

Discussion Reference: https://lore.kernel.org/r/CAMuHMdVkwGjr6dJuMyhQNqFoJqbh6Ec5V2b5LenCshwpM2SDsQ@mail.gmail.com

Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Thomas Kopp <thomas.kopp@microchip.com>
Link: https://lore.kernel.org/r/20200930091423.755-2-thomas.kopp@microchip.com
[mkl: rename file, too]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

dt-binding: can: mcp25xxfd: documentation fixes

Apply following fixes:
- Use 'interrupts'. (interrupts-extended will automagically be supported
by the tools)
- *-supply is always a single item. So, drop maxItems=1
- add "additionalProperties: false" flag to detect unneeded properties.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20200923125301.27200-1-o.rempel@pengutronix.de
Reported-by: Rob Herring <robh@kernel.org>
Reviewed-by: Rob Herring <robh@kernel.org>
Fixes: 2dffde966ad8 ("dt-binding: can: mcp25xxfd: document device tree bindings")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp25xxfd: mcp25xxfd_irq(): add missing initialization of variable set_normal mode

This patch fixes the following warning:

drivers/net/can/spi/mcp25xxfd/mcp25xxfd-core.c:2155 mcp25xxfd_irq()
error: uninitialized symbol 'set_normal_mode'.

by adding the missing initialization.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Fixes: 74ed7c9bd838 ("can: mcp25xxfd: add driver for Microchip MCP25xxFD SPI CAN")
Link: https://lore.kernel.org/r/20200923114726.2704426-1-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp25xxfd: mcp25xxfd_ring_free(): fix memory leak during cleanup

This loop doesn't free the first element of the array. The "i > 0" has
to be changed to "i >= 0".

Fixes: 74ed7c9bd838 ("can: mcp25xxfd: add driver for Microchip MCP25xxFD SPI CAN")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20200923112752.GA1473821@mwanda
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp25xxfd: mcp25xxfd_probe(): add SPI clk limit related errata information

This patch adds a reference to the recent released MCP2517FD and MCP2518FD
errata sheets and paste the explanation.

The driver already implements the proposed fix.

Signed-off-by: Thomas Kopp <thomas.kopp@microchip.com>
Link: https://lore.kernel.org/r/20200925065606.358-1-thomas.kopp@microchip.com
[mkl: split into two patches, adjust subject and commit message]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp25xxfd: mcp25xxfd_handle_eccif(): add ECC related errata and update log messages

This patch adds a reference to the recent released MCP2517FD and MCP2518FD
errata sheets and paste the explanation.

The single error correction does not always work, so always indicate that a
single error occurred. If the location of the ECC error is outside of the
TX-RAM always use netdev_notice() to log the problem. For ECC errors in the
TX-RAM, there is a recovery procedure.

Signed-off-by: Thomas Kopp <thomas.kopp@microchip.com>
Link: https://lore.kernel.org/r/20200925065606.358-1-thomas.kopp@microchip.com
[mkl: split into two patches, adjust subject and commit message]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Merge branch 'HW-support-for-VCAP-IS1-and-ES0-in-mscc_ocelot'

Vladimir Oltean says:

====================
HW support for VCAP IS1 and ES0 in mscc_ocelot

The patches from RFC series "Offload tc-flower to mscc_ocelot switch
using VCAP chains" have been split into 2:
https://patchwork.ozlabs.org/project/netdev/list/?series=204810&state=*

This is the boring part, that deals with the prerequisites, and not with
tc-flower integration. Apart from the initialization of some hardware
blocks, which at this point still don't do anything, no new
functionality is introduced.

- Key and action field offsets are defined for the supported switches.
- VCAP properties are added to the driver for the new TCAM blocks. But
  instead of adding them manually as was done for IS2, which is error
  prone, the driver is refactored to read these parameters from
  hardware, which is possible.
- Some improvements regarding the processing of struct ocelot_vcap_filter.
- Extending the code to be compatible with full and quarter keys.

This series was tested, along with other patches not yet submitted, on
the Felix and Seville switches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: look up the filters in flower_stats() and flower_destroy()

Currently a new filter is created, containing just enough correct
information to be able to call ocelot_vcap_block_find_filter_by_index()
on it.

This will be limiting us in the future, when we'll have more metadata
associated with a filter, which will matter in the stats() and destroy()
callbacks, and which we can't make up on the spot. For example, we'll
start "offloading" some dummy tc filter entries for the TCAM skeleton,
but we won't actually be adding them to the hardware, or to block->rules.
So, it makes sense to avoid deleting those rules too. That's the kind of
thing which is difficult to determine unless we look up the real filter.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: add a new ocelot_vcap_block_find_filter_by_id function

And rename the existing find to ocelot_vcap_block_find_filter_by_index.
The index is the position in the TCAM, and the id is the flow cookie
given by tc.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: rename variable 'cnt' in vcap_data_offset_get()

The 'cnt' variable is actually used for 2 purposes, to hold the number
of sub-words per VCAP entry, and the number of sub-words per VCAP
action.

In fact, I'm pretty sure these 2 numbers can never be different from one
another. By hardware definition, the entry (key) TCAM rows are divided
into the same number of sub-words as its associated action RAM rows.
But nonetheless, let's at least rename the variables such that
observations like this one are easier to make in the future.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: rename variable 'count' in vcap_data_offset_get()

This gets rid of one of the 2 variables named, very generically,
"count".

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: calculate vcap offsets correctly for full and quarter entries

When calculating the offsets for the current entry within the row and
placing them inside struct vcap_data, the function assumes half key
entry (2 keys per row).

This patch modifies the vcap_data_offset_get() function to calculate a
correct data offset when the setting VCAP Type-Group of a key to
VCAP_TG_FULL or VCAP_TG_QUARTER.

This is needed because, for example, VCAP ES0 only supports full keys.

Also rename the 'count' variable to 'num_entries_per_row' to make the
function just one tiny bit easier to follow.

Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: parse flower action before key

When we'll make the switch to multiple chain offloading, we'll want to
know first what VCAP block the rule is offloaded to. This impacts what
keys are available. Since the VCAP block is determined by what actions
are used, parse the action first.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: remove unneeded VCAP parameters for IS2

Now that we are deriving these from the constants exposed by the
hardware, we can delete the static info we're keeping in the driver.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: automatically detect VCAP constants

The numbers in struct vcap_props are not intuitive to derive, because
they are not a straightforward copy-and-paste from the reference manual
but instead rely on a fairly detailed level of understanding of the
layout of an entry in the TCAM and in the action RAM. For this reason,
bugs are very easy to introduce here.

Ease the work of hardware porters and read from hardware the constants
that were exported for this particular purpose. Note that this implies
that struct vcap_props can no longer be const.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: add definitions for VCAP ES0 keys, actions and target

As a preparation step for the offloading to ES0, let's create the
infrastructure for talking with this hardware block.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: add definitions for VCAP IS1 keys, actions and target

As a preparation step for the offloading to IS1, let's create the
infrastructure for talking with this hardware block.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: generalize existing code for VCAP

In the Ocelot switches there are 3 TCAMs: VCAP ES0, IS1 and IS2, which
have the same configuration interface, but different sets of keys and
actions. The driver currently only supports VCAP IS2.

In preparation of VCAP IS1 and ES0 support, the existing code must be
generalized to work with any VCAP.

In that direction, we should move the structures that depend upon VCAP
instantiation, like vcap_is2_keys and vcap_is2_actions, out of struct
ocelot and into struct vcap_props .keys and .actions, a structure that
is replicated 3 times, once per VCAP. We'll pass that structure as an
argument to each function that does the key and action packing - only
the control logic needs to distinguish between ocelot->vcap[VCAP_IS2]
or IS1 or ES0.

Another change is to make use of the newly introduced ocelot_target_read
and ocelot_target_write API, since the 3 VCAPs have the same registers
but put at different addresses.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: return error if VCAP filter is not found

Although it doesn't look like it is possible to hit these conditions
from user space, there are 2 separate, but related, issues.

First, the ocelot_vcap_block_get_filter_index function, née
ocelot_ace_rule_get_index_id prior to the 5e7620b6f9b6 ("net: mscc:
ocelot: generalize the "ACE/ACL" names") rename, does not do what the
author probably intended. If the desired filter entry is not present in
the ACL block, this function returns an index equal to the total number
of filters, instead of -1, which is maybe what was intended, judging
from the curious initialization with -1, and the "++index" idioms.
Either way, none of the callers seems to expect this behavior.

Second issue, the callers don't actually check the return value at all.
So in case the filter is not found in the rule list, propagate the
return code.

So update the callers and also take the opportunity to get rid of the
odd coding idioms that appear to work but don't.

Signed-off-by: Xiaoliang Yang <xiaoliang.yang_1@nxp.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mscc: ocelot: introduce a new ocelot_target_{read,write} API

There are some targets (register blocks) in the Ocelot switch that are
instantiated more than once. For example, the VCAP IS1, IS2 and ES0
blocks all share the same register layout for interacting with the cache
for the TCAM and the action RAM.

For the VCAPs, the procedure for servicing them is actually common. We
just need an API specifying which VCAP we are talking to, and we do that
via these raw ocelot_target_read and ocelot_target_write accessors.

In plain ocelot_read, the target is encoded into the register enum
itself:

u16 target = reg >> TARGET_OFFSET;

For the VCAPs, the registers are currently defined like this:

enum ocelot_reg {
[...]
S2_CORE_UPDATE_CTRL = S2 << TARGET_OFFSET,
S2_CORE_MV_CFG,
S2_CACHE_ENTRY_DAT,
S2_CACHE_MASK_DAT,
S2_CACHE_ACTION_DAT,
S2_CACHE_CNT_DAT,
S2_CACHE_TG_DAT,
[...]
};

which is precisely what we want to avoid, because we'd have to duplicate
the same register map for S1 and for S0, and then figure out how to pass
VCAP instance-specific registers to the ocelot_read calls (basically
another lookup table that undoes the effect of shifting with
TARGET_OFFSET).

So for some targets, propose a more raw API, similar to what is
currently done with ocelot_port_readl and ocelot_port_writel. Those
targets can only be accessed with ocelot_target_{read,write} and not
with ocelot_{read,write} after the conversion, which is fine.

The VCAP registers are not actually modified to use this new API as of
this patch. They will be modified in the next one.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

net: mvneta: avoid possible cache misses in mvneta_rx_swbm

Do not use rx_desc pointers if possible since rx descriptors are stored in
uncached memory and dereferencing rx_desc pointers generate extra loads.
This patch improves XDP_DROP performance of ~ 110Kpps (700Kpps vs 590Kpps)
on Marvell Espressobin

Analyzed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>

libbpf: Compile in PIC mode only for shared library case

Libbpf compiles .o's for static and shared library modes separately, so no
need to specify -fPIC for both. Keep it only for shared library mode.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200929220604.833631-3-andriin@fb.com

libbpf: Compile libbpf under -O2 level by default and catch extra warnings

For some reason compiler doesn't complain about uninitialized variable, fixed
in previous patch, if libbpf is compiled without -O2 optimization level. So do
compile it with -O2 and never let similar issue slip by again. -Wall is added
unconditionally, so no need to specify it again.

Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200929220604.833631-2-andriin@fb.com

libbpf: Fix uninitialized variable in btf_parse_type_sec

Fix obvious unitialized variable use that wasn't reported by compiler. libbpf
Makefile changes to catch such errors are added separately.

Fixes: ae341790ea24 ("libbpf: Support BTF loading and raw data output in both endianness")
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200929220604.833631-1-andriin@fb.com

Merge branch 'bpf, x64: optimize JIT's pro/epilogue'

Maciej Fijalkowski says:

====================
Hi!

This small set can be considered as a followup after recent addition of
support for tailcalls in bpf subprograms and is focused on optimizing
x64 JIT prologue and epilogue sections.

Turns out the popping tail call counter is not needed anymore and %rsp
handling when stack depth is 0 can be skipped.

For longer explanations, please see commit messages.

Thank you,
Maciej
====================

Signed-off-by: Alexei Starovoitov <ast@kernel.org>

bpf: x64: Do not emit sub/add 0, %rsp when !stack_depth

There is no particular reason for keeping the "sub 0, %rsp" insn within
the BPF's x64 JIT prologue.

When tail call code was skipping the whole prologue section these 7
bytes that represent the rsp subtraction could not be simply discarded
as the jump target address would be broken. An option to address that
would be to substitute it with nop7.

Right now tail call is skipping only first 11 bytes of target program's
prologue and "sub X, %rsp" is the first insn that is processed, so if
stack depth is zero then this insn could be omitted without the need for
nop7 swap.

Therefore, do not emit the "sub 0, %rsp" in prologue when program is not
making use of R10 register. Also, make the emission of "add X, %rsp"
conditional in tail call code logic and take into account the presence
of mentioned insn when calculating the jump offsets.

Signed-off-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200929204653.4325-3-maciej.fijalkowski@intel.com