git.baikalelectronics.ru Git

Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next

Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next. This
includes one patch to update ovs and act_ct to use nf_ct_put() instead
of nf_conntrack_put().

1) Add netns_tracker to nfnetlink_log and masquerade, from Eric Dumazet.

2) Remove redundant rcu read-size lock in nf_tables packet path.

3) Replace BUG() by WARN_ON_ONCE() in nft_payload.

4) Consolidate rule verdict tracing.

5) Replace WARN_ON() by WARN_ON_ONCE() in nf_tables core.

6) Make counter support built-in in nf_tables.

7) Add new field to conntrack object to identify locally generated
   traffic, from Florian Westphal.

8) Prevent NAT from shadowing well-known ports, from Florian Westphal.

9) Merge nf_flow_table_{ipv4,ipv6} into nf_flow_table_inet, also from
   Florian.

10) Remove redundant pointer in nft_pipapo AVX2 support, from Colin Ian King.

11) Replace opencoded max() in conntrack, from Jiapeng Chong.

12) Update conntrack to use refcount_t API, from Florian Westphal.

13) Move ip_ct_attach indirection into the nf_ct_hook structure.

14) Constify several pointer object in the netfilter codebase,
    from Florian Westphal.

15) Tree-wide replacement of nf_conntrack_put() by nf_ct_put(), also
    from Florian.

16) Fix egress splat due to incorrect rcu notation, from Florian.

17) Move stateful fields of connlimit, last, quota, numgen and limit
    out of the expression data area.

18) Build a blob to represent the ruleset in nf_tables, this is a
    requirement of the new register tracking infrastructure.

19) Add NFT_REG32_NUM to define the maximum number of 32-bit registers.

20) Add register tracking infrastructure to skip redundant
    store-to-register operations, this includes support for payload,
    meta and bitwise expresssions.

* git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next: (32 commits)
  netfilter: nft_meta: cancel register tracking after meta update
  netfilter: nft_payload: cancel register tracking after payload update
  netfilter: nft_bitwise: track register operations
  netfilter: nft_meta: track register operations
  netfilter: nft_payload: track register operations
  netfilter: nf_tables: add register tracking infrastructure
  netfilter: nf_tables: add NFT_REG32_NUM
  netfilter: nf_tables: add rule blob layout
  netfilter: nft_limit: move stateful fields out of expression data
  netfilter: nft_limit: rename stateful structure
  netfilter: nft_numgen: move stateful fields out of expression data
  netfilter: nft_quota: move stateful fields out of expression data
  netfilter: nft_last: move stateful fields out of expression data
  netfilter: nft_connlimit: move stateful fields out of expression data
  netfilter: egress: avoid a lockdep splat
  net: prefer nf_ct_put instead of nf_conntrack_put
  netfilter: conntrack: avoid useless indirection during conntrack destruction
  netfilter: make function op structures const
  netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook
  netfilter: conntrack: convert to refcount_t api
  ...
====================

Link: https://lore.kernel.org/r/20220109231640.104123-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

netfilter: nft_meta: cancel register tracking after meta update

The meta expression might mangle the packet metadata, cancel register
tracking since any metadata in the registers is stale.

Finer grain register tracking cancellation by inspecting the meta type
on the register is also possible.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_payload: cancel register tracking after payload update

The payload expression might mangle the packet, cancel register tracking
since any payload data in the registers is stale.

Finer grain register tracking cancellation by inspecting the payload
base, offset and length on the register is also possible.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_bitwise: track register operations

Check if the destination register already contains the data that this
bitwise expression performs. This allows to skip this redundant
operation.

If the destination contains a different bitwise operation, cancel the
register tracking information. If the destination contains no bitwise
operation, update the register tracking information.

Update the payload and meta expression to check if this bitwise
operation has been already performed on the register. Hence, both the
payload/meta and the bitwise expressions are reduced.

There is also a special case: If source register != destination register
and source register is not updated by a previous bitwise operation, then
transfer selector from the source register to the destination register.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_meta: track register operations

Check if the destination register already contains the data that this
meta store expression performs. This allows to skip this redundant
operation. If the destination contains a different selector, update
the register tracking information.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_payload: track register operations

Check if the destination register already contains the data that this
payload store expression performs. This allows to skip this redundant
operation. If the destination contains a different selector, update
the register tracking information.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add register tracking infrastructure

This patch adds new infrastructure to skip redundant selector store
operations on the same register to achieve a performance boost from
the packet path.

This is particularly noticeable in pure linear rulesets but it also
helps in rulesets which are already heaving relying in maps to avoid
ruleset linear inspection.

The idea is to keep data of the most recurrent store operations on
register to reuse them with cmp and lookup expressions.

This infrastructure allows for dynamic ruleset updates since the ruleset
blob reduction happens from the kernel.

Userspace still needs to be updated to maximize register utilization to
cooperate to improve register data reuse / reduce number of store on
register operations.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add NFT_REG32_NUM

Add a definition including the maximum number of 32-bits registers that
are used a scratchpad memory area to store data.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nf_tables: add rule blob layout

This patch adds a blob layout per chain to represent the ruleset in the
packet datapath.

size (unsigned long)
struct nft_rule_dp
  struct nft_expr
  ...
        struct nft_rule_dp
          struct nft_expr
          ...
        struct nft_rule_dp (is_last=1)

The new structure nft_rule_dp represents the rule in a more compact way
(smaller memory footprint) compared to the control-plane nft_rule
structure.

The ruleset blob is a read-only data structure. The first field contains
the blob size, then the rules containing expressions. There is a trailing
rule which is used by the tracing infrastructure which is equivalent to
the NULL rule marker in the previous representation. The blob size field
does not include the size of this trailing rule marker.

The ruleset blob is generated from the commit path.

This patch reuses the infrastructure available since 82560e53d11e
("netfilter: nf_tables: remove synchronize_rcu in commit phase") to
build the array of rules per chain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_limit: move stateful fields out of expression data

In preparation for the rule blob representation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_limit: rename stateful structure

From struct nft_limit to nft_limit_priv.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_numgen: move stateful fields out of expression data

In preparation for the rule blob representation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_quota: move stateful fields out of expression data

In preparation for the rule blob representation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_last: move stateful fields out of expression data

In preparation for the rule blob representation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: nft_connlimit: move stateful fields out of expression data

In preparation for the rule blob representation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: egress: avoid a lockdep splat

include/linux/netfilter_netdev.h:97 suspicious rcu_dereference_check() usage!
2 locks held by sd-resolve/1100:
0: ..(rcu_read_lock_bh){1:3}, at: ip_finish_output2
1: ..(rcu_read_lock_bh){1:3}, at: __dev_queue_xmit
__dev_queue_xmit+0 ..

The helper has two callers, one uses rcu_read_lock, the other
rcu_read_lock_bh(). Annotate the dereference to reflect this.

Fixes: 0c0f505da7c89 ("netfilter: Introduce egress hook")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

net: prefer nf_ct_put instead of nf_conntrack_put

Its the same as nf_conntrack_put(), but without the
need for an indirect call. The downside is a module dependency on
nf_conntrack, but all of these already depend on conntrack anyway.

Cc: Paul Blakey <paulb@mellanox.com>
Cc: dev@openvswitch.org
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: avoid useless indirection during conntrack destruction

nf_ct_put() results in a usesless indirection:

nf_ct_put -> nf_conntrack_put -> nf_conntrack_destroy -> rcu readlock +
indirect call of ct_hooks->destroy().

There are two _put helpers:
nf_ct_put and nf_conntrack_put. The latter is what should be used in
code that MUST NOT cause a linker dependency on the conntrack module
(e.g. calls from core network stack).

Everyone else should call nf_ct_put() instead.

A followup patch will convert a few nf_conntrack_put() calls to
nf_ct_put(), in particular from modules that already have a conntrack
dependency such as act_ct or even nf_conntrack itself.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: make function op structures const

No functional changes, these structures should be const.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: core: move ip_ct_attach indirection to struct nf_ct_hook

ip_ct_attach predates struct nf_ct_hook, we can place it there and
remove the exported symbol.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: convert to refcount_t api

Convert nf_conn reference counting from atomic_t to refcount_t based api.
refcount_t api provides more runtime sanity checks and will warn on
certain constructs, e.g. refcount_inc() on a zero reference count, which
usually indicates use-after-free.

For this reason template allocation is changed to init the refcount to
1, the subsequenct add operations are removed.

Likewise, init_conntrack() is changed to set the initial refcount to 1
instead refcount_inc().

This is safe because the new entry is not (yet) visible to other cpus.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

netfilter: conntrack: Use max() instead of doing it manually

Fix following coccicheck warning:

./include/net/netfilter/nf_conntrack.h:282:16-17: WARNING opportunity
for max().

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>

Merge tag 'for-net-next-2022-01-07' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next

Luiz Augusto von Dentz says:

====================
bluetooth-next pull request for net-next:

- Add support for Foxconn QCA 0xe0d0
- Fix HCI init sequence on MacBook Air 8,1 and 8,2
- Fix Intel firmware loading on legacy ROM devices

* tag 'for-net-next-2022-01-07' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next:
  Bluetooth: hci_sock: fix endian bug in hci_sock_setsockopt()
  Bluetooth: L2CAP: uninitialized variables in l2cap_sock_setsockopt()
  Bluetooth: btqca: sequential validation
  Bluetooth: btusb: Add support for Foxconn QCA 0xe0d0
  Bluetooth: btintel: Fix broken LED quirk for legacy ROM devices
  Bluetooth: hci_event: Rework hci_inquiry_result_with_rssi_evt
  Bluetooth: btbcm: disable read tx power for MacBook Air 8,1 and 8,2
  Bluetooth: hci_qca: Fix NULL vs IS_ERR_OR_NULL check in qca_serdev_probe
  Bluetooth: hci_bcm: Check for error irq
====================

Link: https://lore.kernel.org/r/20220107210942.3750887-1-luiz.dentz@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge tag 'linux-can-next-for-5.17-20220108' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2022-01-08

this is a pull request of 22 patches for net-next/master.

The first patch is by Tom Rix and fixes an uninitialized variable in
the janz-ican3 driver (introduced in linux-can-next-for-5.17-20220105).

The next 13 patches are by my and target the mcp251xfd driver. First
several cleanup patches, then the driver is prepared for the upcoming
ethtool ring parameter and IRQ coalescing support, which is added in a
later pull request.

The remaining 8 patches are by Dario Binacchi and me and enhance the
flexcan driver. The driver is moved into a sub directory. An ethtool
private flag is added to optionally disable CAN RTR frame reception,
to make use of more RX buffers. The resulting RX buffer configuration
can be read by ethtool ring parameter support. Finally documentation
for the ethtool private flag is added to the
Documentation/networking/device_drivers/can directory.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

docs: networking: device drivers: can: add flexcan

Add initial documentation for Flexcan driver.

Link: https://lore.kernel.org/all/20220107193105.1699523-8-mkl@pengutronix.de
Signed-off-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

docs: networking: device drivers: add can sub-folder

Add the container for CAN drivers documentation.

Link: https://lore.kernel.org/all/20220107193105.1699523-7-mkl@pengutronix.de
Signed-off-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: add ethtool support to get rx/tx ring parameters

This patch adds ethtool support to get the number of message buffers
configured for reception/transmission, which may also depends on
runtime configurations such as the 'rx-rtr' flag state.

Link: https://lore.kernel.org/all/20220108181633.420433-1-dario.binacchi@amarulasolutions.com
Signed-off-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
[mkl: port to net-next/master, replace __sw_hweight64 by simpler calculation]
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: add ethtool support to change rx-rtr setting during runtime

This patch adds a private flag to the flexcan driver to switch the
"rx-rtr" setting on and off.

"rx-rtr" on  - Receive RTR frames. (default)
               The CAN controller can and will receive RTR frames.

       On some IP cores the controller cannot receive RTR
       frames in the more performant "RX mailbox" mode and
       will use "RX FIFO" mode instead.

"rx-rtr" off - Waive ability to receive RTR frames. (not supported on all IP cores)
               This mode activates the "RX mailbox mode" for better
       performance, on some IP cores RTR frames cannot be
       received anymore.

The "RX FIFO" mode uses a FIFO with a depth of 6 CAN frames.
The "RX mailbox" mode uses up to 62 mailboxes.

Link: https://lore.kernel.org/all/20220107193105.1699523-6-mkl@pengutronix.de
Signed-off-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Co-developed-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: add more quirks to describe RX path capabilities

Most flexcan IP cores support 2 RX modes:
- FIFO
- mailbox

Some IP core versions cannot receive CAN RTR messages via mailboxes.
This patch adds quirks to document this.

This information will be used in a later patch to switch from FIFO to
more performant mailbox mode at the expense of losing the ability to
receive RTR messages. This trade off is beneficial in certain use
cases.

Link: https://lore.kernel.org/all/20220107193105.1699523-5-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: rename RX modes

Most flexcan IP cores support 2 RX modes:
- FIFO
- mailbox

The names for these modes were chosen to reflect the name of the
rx-offload mode they are using.

The name of the RX modes should better reflect their difference with
regards the flexcan IP core. So this patch renames the various
occurrences of OFF_FIFO to RX_FIFO and OFF_TIMESTAMP to RX_MAILBOX:

| FLEXCAN_TX_MB_RESERVED_OFF_FIFO -> FLEXCAN_TX_MB_RESERVED_RX_FIFO
| FLEXCAN_TX_MB_RESERVED_OFF_TIMESTAMP -> FLEXCAN_TX_MB_RESERVED_RX_MAILBOX
| FLEXCAN_QUIRK_USE_OFF_TIMESTAMP -> FLEXCAN_QUIRK_USE_RX_MAILBOX

Link: https://lore.kernel.org/all/20220107193105.1699523-4-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: allow to change quirks at runtime

This is a preparation patch for the upcoming support to change the
rx-rtr capability via the ethtool API.

Link: https://lore.kernel.org/all/20220107193105.1699523-3-mkl@pengutronix.de
Signed-off-by: Dario Binacchi <dario.binacchi@amarulasolutions.com>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: flexcan: move driver into separate sub directory

This patch moves the flexcan driver into a separate directory, a later
patch will add more files.

Link: https://lore.kernel.org/all/20220107193105.1699523-2-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: introduce and make use of mcp251xfd_is_fd_mode()

This patch replaces the open coded check, if the chip's FIFOs are
configured for CAN-FD mode, by the newly introduced function
mcp251xfd_is_fd_mode().

Link: https://lore.kernel.org/all/20220105154300.1258636-14-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: move ring init into separate function

This patch moves the ring initialization from the mcp251xfd core file
into a separate one to make the driver a bit more orderly.

Link: https://lore.kernel.org/all/20220105154300.1258636-13-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: move chip FIFO init into separate file

This patch moves the chip FIFO initialization from the mcp251xfd core
file into a separate one to make the driver a bit more orderly.

Link: https://lore.kernel.org/all/20220105154300.1258636-12-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: move TEF handling into separate file

This patch moves the TEF handling from the mcp251xfd core file into a
separate one to make the driver a bit more orderly.

Link: https://lore.kernel.org/all/20220105154300.1258636-11-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: move TX handling into separate file

This patch moves the TX handling from the mcp251xfd core file into a
separate one to make the driver a bit more orderly.

Link: https://lore.kernel.org/all/20220105154300.1258636-10-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: move RX handling into separate file

This patch moves the RX handling from the mcp251xfd core file into a
separate one to make the driver a bit more orderly.

Link: https://lore.kernel.org/all/20220105154300.1258636-9-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: mcp251xfd.h: sort function prototypes

The .c files in the Makefile are ordered alphabetically. This patch
groups the function prototypes by their corresponding .c file and
brings the into the same order.

Link: https://lore.kernel.org/all/20220105154300.1258636-8-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: mcp251xfd_handle_rxovif(): denote RX overflow message to debug + add rate limiting

A RX overflow usually happens during high system load. Printing
overflow messages to the kernel log, which on embedded systems often
is outputted on the serial console, even increases the system load.

To decrease the system load in these situations, denote the messages
to debug level and wrap them with net_ratelimit().

Link: https://lore.kernel.org/all/20220105154300.1258636-7-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: mcp251xfd_open(): make use of pm_runtime_resume_and_get()

With patch

| c4a28fe77ffc PM: runtime: Add pm_runtime_resume_and_get to deal with usage counter

the usual pm_runtime_get_sync() and pm_runtime_put_noidle()
in-case-of-error dance is no longer needed. Convert the mcp251xfd
driver to use this function.

Link: https://lore.kernel.org/all/20220105154300.1258636-6-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: mcp251xfd_open(): open_candev() first

This patch exchanges the order of open_candev() and
pm_runtime_get_sync(), so that open_candev() is called first.

A usual reason why open_candev() fails is missing CAN bit rate
configuration. It makes no sense to resume the device from PM sleep
first just to put it to sleep if the bit rate is not configured.

Link: https://lore.kernel.org/all/20220105154300.1258636-5-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: add missing newline to printed strings

This patch adds the missing newline to printed strings.

Fixes: 9afc4ad0aed9 ("can: mcp25xxfd: add driver for Microchip MCP25xxFD SPI CAN")
Link: https://lore.kernel.org/all/20220105154300.1258636-4-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: mcp251xfd_tef_obj_read(): fix typo in error message

This patch fixes a typo in the error message in
mcp251xfd_tef_obj_read(), if trying to read too many objects.

Link: https://lore.kernel.org/all/20220105154300.1258636-3-mkl@pengutronix.de
Fixes: 9afc4ad0aed9 ("can: mcp25xxfd: add driver for Microchip MCP25xxFD SPI CAN")
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: mcp251xfd: remove double blank lines

This patch removes double blank lines from the driver.

Link: https://lore.kernel.org/all/20220105154300.1258636-2-mkl@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

can: janz-ican3: initialize dlc variable

Clang static analysis reports this problem
janz-ican3.c:1311:2: warning: Undefined or garbage value returned to caller
        return dlc;
        ^~~~~~~~~~

dlc is only set with this conditional
if (!(cf->can_id & CAN_RTR_FLAG))
dlc = cf->len;

But is always returned.  So initialize dlc to 0.

Fixes: b01d5543f78d ("can: do not increase tx_bytes statistics for RTR frames")
Link: https://lore.kernel.org/all/20220108143319.3986923-1-trix@redhat.com
Signed-off-by: Tom Rix <trix@redhat.com>
Acked-by: Vincent Mailhol <mailhol.vincent@wanadoo.fr>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>

Merge branch 'ena-capabilities-field-and-cosmetic-changes'

Arthur Kiyanovski says:

====================
ENA: capabilities field and cosmetic changes

Add a new capabilities bitmask field to get indication of
capabilities supported by the device. Use the capabilities
field to query the device for ENI stats support.

Other patches are cosmetic changes like fixing readme
mistakes, removing unused variables etc...
====================

Link: https://lore.kernel.org/r/20220107202346.3522-1-akiyano@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Extract recurring driver reset code into a function

Create an inline function for resetting the driver
to reduce code duplication.

Signed-off-by: Nati Koler <nkoler@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Change the name of bad_csum variable

Changed bad_csum to csum_bad to align with csum_unchecked & csum_good

Signed-off-by: Nati Koler <nkoler@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Add debug prints for invalid req_id resets

Add qid and req_id to error prints when ENA_REGS_RESET_INV_TX_REQ_ID
reset occurs.

Switch from %hu to %u, since u16 should be printed with %u, as
explained in [1].

[1] - https://www.kernel.org/doc/html/latest/core-api/printk-formats.html

Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Remove ena_calc_queue_size_ctx struct

This struct was used to pass data from callee function to its caller.
Its usage can be avoided.

Removing it results in less code without any damage to code readability.
Also it allows to consolidate ring size calculation into a single
function (ena_calc_io_queue_size()).

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Move reset completion print to the reset function

The print that indicates that device reset has finished is
currently called from ena_restore_device().

Move it to ena_fw_reset_device() as it is the more natural
location for it.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Remove redundant return code check

The ena_com_indirect_table_fill_entry() function only returns -EINVAL
or 0, no need to check for -EOPNOTSUPP.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Update LLQ header length in ena documentation

LLQ entry length is 128 bytes. Therefore the maximum header in
the entry is calculated by:
tx_max_header_size =
LLQ_ENTRY_SIZE - DESCRIPTORS_NUM_BEFORE_HEADER * 16 =
128 - 2 * 16 = 96

This patch updates the documentation so that it states the correct
max header length.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Change ENI stats support check to use capabilities field

Use the capabilities field to query the device for ENI stats
support.

This replaces the previous method that tried to get the ENI stats
during ena_probe() and used the success or failure as an indication
for support by the device.

Remove eni_stats_supported field from struct ena_adapter. This field
was used for the previous method of queriying for ENI stats support.

Change the severity level of the print in case of
ena_com_get_eni_stats() failure from info to error.
With the previous method of querying form ENI stats support, failure
to get ENI stats was normal for devices that don't support it.
With the use of the capabilities field such a failure is unexpected,
as it is called only if the device reported that it supports ENI
stats.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Add capabilities field with support for ENI stats capability

This bitmask field indicates what capabilities are supported by the
device.

The capabilities field differs from the 'supported_features' field which
indicates what sub-commands for the set/get feature commands are
supported. The sub-commands are specified in the 'feature_id' field of
the 'ena_admin_set_feat_cmd' struct in the following way:

        struct ena_admin_set_feat_cmd cmd;

        cmd.aq_common_descriptor.opcode = ENA_ADMIN_SET_FEATURE;
        cmd.feat_common.feature_

The 'capabilities' field, on the other hand, specifies different
capabilities of the device. For example, whether the device supports
querying of ENI stats.

Also add an enumerator which contains all the capabilities. The
first added capability macro is for ENI stats feature.

Capabilities are queried along with the other device attributes (in
ena_com_get_dev_attr_feat()) during device initialization and are stored
in the ena_com_dev struct. They can be later queried using the
ena_com_get_cap() helper function.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: ena: Change return value of ena_calc_io_queue_size() to void

ena_calc_io_queue_size() always returns 0, therefore make it a
void function and update the calling function to stop checking
the return value.

Signed-off-by: Shay Agroskin <shayagr@amazon.com>
Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

af_packet: fix tracking issues in packet_do_bind()

It appears that my changes in packet_do_bind() were
slightly wrong.

syzbot found that calling bind() twice would trigger
a false positive.

Remove proto_curr/dev_curr variables and rewrite things
to be less confusing (like not having to use netdev_tracker_alloc(),
and instead use the standard dev_hold_track())

Fixes: 5f1b96644de5 ("net: add net device refcount tracker to struct packet_type")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Link: https://lore.kernel.org/r/20220107183953.3886647-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-refactoring-for-one-selftest-and-csum-validation'

Mat Martineau says:

====================
mptcp: Refactoring for one selftest and csum validation

Patch 1 changes the MPTCP join self tests to depend more on events
rather than delays, so the script runs faster and has more consistent
results.

Patches 2 and 3 get rid of some duplicate code in MPTCP's checksum
validation by modifying and leveraging an existing helper function.
====================

Link: https://lore.kernel.org/r/20220107192524.445137-1-mathew.j.martineau@linux.intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: reuse __mptcp_make_csum in validate_data_csum

This patch reused __mptcp_make_csum() in validate_data_csum() instead of
open-coding.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mptcp: change the parameter of __mptcp_make_csum

This patch changed the type of the last parameter of __mptcp_make_csum()
from __sum16 to __wsum. And export this function in protocol.h.

Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

selftests: mptcp: more stable join tests-cases

MPTCP join self-tests are a bit fragile as they reply on
delays instead of events to catch-up with the expected
sockets states.

Replace the delay with state checking where possible and
reduce the number of sleeps in the most complex scenarios.

This will both reduce the tests run-time and will improve
stability.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: dsa: felix: add port fast age support

Add support for flushing the MAC table on a given port in the ocelot
switch library, and use this functionality in the felix DSA driver.

This operation is needed when a port leaves a bridge to become
standalone, and when the learning is disabled, and when the STP state
changes to a state where no FDB entry should be present.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Link: https://lore.kernel.org/r/20220107144229.244584-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net: mscc: ocelot: fix incorrect balancing with down LAG ports

Assuming the test setup described here:
https://patchwork.kernel.org/project/netdevbpf/cover/20210205130240.4072854-1-vladimir.oltean@nxp.com/
(swp1 and swp2 are in bond0, and bond0 is in a bridge with swp0)

it can be seen that when swp1 goes down (on either board A or B), then
traffic that should go through that port isn't forwarded anywhere.

A dump of the PGID table shows the following:

PGID_DST[0] = ports 0
PGID_DST[1] = ports 1
PGID_DST[2] = ports 2
PGID_DST[3] = ports 3
PGID_DST[4] = ports 4
PGID_DST[5] = ports 5
PGID_DST[6] = no ports
PGID_AGGR[0] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[1] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[2] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[3] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[4] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[5] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[6] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[7] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[8] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[9] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[10] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[11] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[12] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[13] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[14] = ports 0, 1, 2, 3, 4, 5
PGID_AGGR[15] = ports 0, 1, 2, 3, 4, 5
PGID_SRC[0] = ports 1, 2
PGID_SRC[1] = ports 0
PGID_SRC[2] = ports 0
PGID_SRC[3] = no ports
PGID_SRC[4] = no ports
PGID_SRC[5] = no ports
PGID_SRC[6] = ports 0, 1, 2, 3, 4, 5

Whereas a "good" PGID configuration for that setup should have looked
like this:

PGID_DST[0] = ports 0
PGID_DST[1] = ports 1, 2
PGID_DST[2] = ports 1, 2
PGID_DST[3] = ports 3
PGID_DST[4] = ports 4
PGID_DST[5] = ports 5
PGID_DST[6] = no ports
PGID_AGGR[0] = ports 0, 2, 3, 4, 5
PGID_AGGR[1] = ports 0, 2, 3, 4, 5
PGID_AGGR[2] = ports 0, 2, 3, 4, 5
PGID_AGGR[3] = ports 0, 2, 3, 4, 5
PGID_AGGR[4] = ports 0, 2, 3, 4, 5
PGID_AGGR[5] = ports 0, 2, 3, 4, 5
PGID_AGGR[6] = ports 0, 2, 3, 4, 5
PGID_AGGR[7] = ports 0, 2, 3, 4, 5
PGID_AGGR[8] = ports 0, 2, 3, 4, 5
PGID_AGGR[9] = ports 0, 2, 3, 4, 5
PGID_AGGR[10] = ports 0, 2, 3, 4, 5
PGID_AGGR[11] = ports 0, 2, 3, 4, 5
PGID_AGGR[12] = ports 0, 2, 3, 4, 5
PGID_AGGR[13] = ports 0, 2, 3, 4, 5
PGID_AGGR[14] = ports 0, 2, 3, 4, 5
PGID_AGGR[15] = ports 0, 2, 3, 4, 5
PGID_SRC[0] = ports 1, 2
PGID_SRC[1] = ports 0
PGID_SRC[2] = ports 0
PGID_SRC[3] = no ports
PGID_SRC[4] = no ports
PGID_SRC[5] = no ports
PGID_SRC[6] = ports 0, 1, 2, 3, 4, 5

In other words, in the "bad" configuration, the attempt is to remove the
inactive swp1 from the destination ports via PGID_DST. But when a MAC
table entry is learned, it is learned towards PGID_DST 1, because that
is the logical port id of the LAG itself (it is equal to the lowest
numbered member port). So when swp1 becomes inactive, if we set
PGID_DST[1] to contain just swp1 and not swp2, the packet will not have
any chance to reach the destination via swp2.

The "correct" way to remove swp1 as a destination is via PGID_AGGR
(remove swp1 from the aggregation port groups for all aggregation
codes). This means that PGID_DST[1] and PGID_DST[2] must still contain
both swp1 and swp2. This makes the MAC table still treat packets
destined towards the single-port LAG as "multicast", and the inactive
ports are removed via the aggregation code tables.

The change presented here is a design one: the ocelot_get_bond_mask()
function used to take an "only_active_ports" argument. We don't need
that. The only call site that specifies only_active_ports=true,
ocelot_set_aggr_pgids(), must retrieve the entire bonding mask, because
it must program that into PGID_DST. Additionally, it must also clear the
inactive ports from the bond mask here, which it can't do if bond_mask
just contains the active ports:

ac = ocelot_read_rix(ocelot, ANA_PGID_PGID, i);
ac &= ~bond_mask; <---- here
/* Don't do division by zero if there was no active
* port. Just make all aggregation codes zero.
*/
if (num_active_ports)
ac |= BIT(aggr_idx[i % num_active_ports]);
ocelot_write_rix(ocelot, ac, ANA_PGID_PGID, i);

So it becomes the responsibility of ocelot_set_aggr_pgids() to take
ocelot_port->lag_tx_active into consideration when populating the
aggr_idx array.

Fixes: 20111d10360d ("net: mscc: ocelot: rebalance LAGs on link up/down events")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Link: https://lore.kernel.org/r/20220107164332.402133-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
40GbE Intel Wired LAN Driver Updates 2022-01-07

This series contains updates to i40e and iavf drivers.

Karen limits per VF MAC filters so that one VF does not consume all
filters for i40e.

Jedrzej reduces busy wait time for admin queue calls for i40e.

Mateusz updates firmware versions to reflect new supported NVM images
and renames an error to remove non-inclusive language for i40e.

Yang Li fixes a set but not used warning for i40e.

Jason Wang removes an unneeded variable for iavf.

* '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  iavf: remove an unneeded variable
  i40e: remove variables set but not used
  i40e: Remove non-inclusive language
  i40e: Update FW API version
  i40e: Minimize amount of busy-waiting during AQ send
  i40e: Add ensurance of MacVlan resources for every trusted VF
====================

Link: https://lore.kernel.org/r/20220107175704.438387-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

net/tls: Fix skb memory leak when running kTLS traffic

The cited Fixes commit introduced a memory leak when running kTLS
traffic (with/without hardware offloads).
I'm running nginx on the server side and wrk on the client side and get
the following:

  unreferenced object 0xffff8881935e9b80 (size 224):
  comm "softirq", pid 0, jiffies 4294903611 (age 43.204s)
  hex dump (first 32 bytes):
    80 9b d0 36 81 88 ff ff 00 00 00 00 00 00 00 00  ...6............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000efe2a999>] build_skb+0x1f/0x170
    [<00000000ef521785>] mlx5e_skb_from_cqe_mpwrq_linear+0x2bc/0x610 [mlx5_core]
    [<00000000945d0ffe>] mlx5e_handle_rx_cqe_mpwrq+0x264/0x9e0 [mlx5_core]
    [<00000000cb675b06>] mlx5e_poll_rx_cq+0x3ad/0x17a0 [mlx5_core]
    [<0000000018aac6a9>] mlx5e_napi_poll+0x28c/0x1b60 [mlx5_core]
    [<000000001f3369d1>] __napi_poll+0x9f/0x560
    [<00000000cfa11f72>] net_rx_action+0x357/0xa60
    [<000000008653b8d7>] __do_softirq+0x282/0x94e
    [<00000000644923c6>] __irq_exit_rcu+0x11f/0x170
    [<00000000d4085f8f>] irq_exit_rcu+0xa/0x20
    [<00000000d412fef4>] common_interrupt+0x7d/0xa0
    [<00000000bfb0cebc>] asm_common_interrupt+0x1e/0x40
    [<00000000d80d0890>] default_idle+0x53/0x70
    [<00000000f2b9780e>] default_idle_call+0x8c/0xd0
    [<00000000c7659e15>] do_idle+0x394/0x450

I'm not familiar with these areas of the code, but I've added this
sk_defer_free_flush() to tls_sw_recvmsg() based on a hunch and it
resolved the issue.

Fixes: 4ae7ff61863f ("tcp: defer skb freeing after socket lock is released")
Signed-off-by: Gal Pressman <gal@nvidia.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20220102081253.9123-1-gal@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

iavf: remove an unneeded variable

The variable `ret_code' used for returning is never changed in function
`iavf_shutdown_adminq'. So that it can be removed and just return its
initial value 0 at the end of `iavf_shutdown_adminq' function.

Signed-off-by: Jason Wang <wangborong@cdjrlc.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: remove variables set but not used

The code that uses variables pe_cntx_size and pe_filt_size
has been removed, so they should be removed as well.

Eliminate the following clang warnings:
drivers/net/ethernet/intel/i40e/i40e_common.c:4139:20:
warning: variable 'pe_filt_size' set but not used.
drivers/net/ethernet/intel/i40e/i40e_common.c:4139:6:
warning: variable 'pe_cntx_size' set but not used.

Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Signed-off-by: Yang Li <yang.lee@linux.alibaba.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Remove non-inclusive language

Remove non-inclusive language from the driver.

Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Update FW API version

Update FW API versions to the newest supported NVM images.

Signed-off-by: Sylwester Dziedziuch <sylwesterx.dziedziuch@intel.com>
Signed-off-by: Mateusz Palczewski <mateusz.palczewski@intel.com>
Tested-by: Gurucharan G <gurucharanx.g@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Minimize amount of busy-waiting during AQ send

The i40e_asq_send_command will now use a non blocking usleep_range if
possible (non-atomic context), instead of busy-waiting udelay. The
usleep_range function uses hrtimers to provide better performance and
removes the negative impact of busy-waiting in time-critical
environments.

1. Rename i40e_asq_send_command to i40e_asq_send_command_atomic
   and add 5th parameter to inform if called from an atomic context.
   Call inside usleep_range (if non-atomic) or udelay (if atomic).

2. Change i40e_asq_send_command to invoke
   i40e_asq_send_command_atomic(..., false).

3. Change two functions:
    - i40e_aq_set_vsi_uc_promisc_on_vlan
    - i40e_aq_set_vsi_mc_promisc_on_vlan
   to explicitly use i40e_asq_send_command_atomic(..., true)
   instead of i40e_asq_send_command, as they use spinlocks and do some
   work in an atomic context.
   All other calls to i40e_asq_send_command remain unchanged.

Signed-off-by: Dawid Lukwinski <dawid.lukwinski@intel.com>
Signed-off-by: Jedrzej Jagielski <jedrzej.jagielski@intel.com>
Tested-by: Tony Brelinski <tony.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

i40e: Add ensurance of MacVlan resources for every trusted VF

Trusted VF can use up every resource available, leaving nothing
to other trusted VFs.
Introduce define, which calculates MacVlan resources available based
on maximum available MacVlan resources, bare minimum for each VF and
number of currently allocated VFs.

Signed-off-by: Przemyslaw Patynowski <przemyslawx.patynowski@intel.com>
Signed-off-by: Karen Sornek <karen.sornek@intel.com>
Tested-by: Tony Brelinski <tony.brelinski@intel.com>
Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>

sch_cake: revise Diffserv docs

Documentation incorrectly stated that CS1 is equivalent to LE for
diffserv8. But when LE was added to the table, CS1 was pushed into tin
1, leaving only LE in tin 0.

Also "TOS1" no longer exists, as that is the same codepoint as LE.

Make other tweaks properly distinguishing codepoints from classes and
putting current Diffserve codepoints ahead of legacy ones.

Signed-off-by: Kevin Bracey <kevin@bracey.fi>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Link: https://lore.kernel.org/r/20220106215637.3132391-1-kevin@bracey.fi
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mptcp-next'

Mat Martineau says:

====================
mptcp: New features and cleanup

These patches have been tested in the MPTCP tree for a longer than usual
time (thanks to holiday schedules), and are ready for the net-next
branch. Changes include feature updates, small fixes, refactoring, and
some selftest changes.

Patch 1 fixes an OUTQ ioctl issue with TCP fallback sockets.

Patches 2, 3, and 6 add support of the MPTCP fastclose option (quick
shutdown of the full MPTCP connection, similar to TCP RST in regular
TCP), and a related self test.

Patch 4 cleans up some accept and poll code that is no longer needed
after the fastclose changes.

Patch 5 add userspace disconnect using AF_UNSPEC, which is used when
testing fastclose and makes the MPTCP socket's handling of AF_UNSPEC in
connect() more TCP-like.

Patches 7-11 refactor subflow creation to make better use of multiple
local endpoints and to better handle individual connection failures when
creating multiple subflows. Includes self test updates.

Patch 12 cleans up the way subflows are added to the MPTCP connection
list, eliminating the need for calls throughout the MPTCP code that had
to check the intermediate "join list" for entries to shift over to the
main "connection list".

Patch 13 refactors the MPTCP release_cb flags to use separate storage
for values only accessed with the socket lock held (no atomic ops
needed), and for values that need atomic operations.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: avoid atomic bit manipulation when possible

Currently the msk->flags bitmask carries both state for the
mptcp_release_cb() - mostly touched under the mptcp data lock
- and others state info touched even outside such lock scope.

As a consequence, msk->flags is always manipulated with
atomic operations.

This change splits such bitmask in two separate fields, so
that we use plain bit operations when touching the
cb-related info.

The MPTCP_PUSH_PENDING bit needs additional care, as it is the
only CB related field currently accessed either under the mptcp
data lock or the mptcp socket lock.
Let's add another mask just for such bit's sake.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: cleanup MPJ subflow list handling

We can simplify the join list handling leveraging the
mptcp_release_cb(): if we can acquire the msk socket
lock at mptcp_finish_join time, move the new subflow
directly into the conn_list, otherwise place it on join_list and
let the release_cb process such list.

Since pending MPJ connection are now always processed
in a timely way, we can avoid flushing the join list
every time we have to process all the current subflows.

Additionally we can now use the mptcp data lock to protect
the join_list, removing the additional spin lock.

Finally, the MPJ handshake is now always finalized under the
msk socket lock, we can drop the additional synchronization
between mptcp_finish_join() and mptcp_close().

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: add tests for subflow creation failure

Verify that, when multiple endpoints are available, subflows
creation proceed even when the first additional subflow creation
fails - due to packet drop on the relevant link

Co-developed-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: do not block subflows creation on errors

If the MPTCP configuration allows for multiple subflows
creation, and the first additional subflows never reach
the fully established status - e.g. due to packets drop or
reset - the in kernel path manager do not move to the
next subflow.

This patch introduces a new PM helper to cope with MPJ
subflow creation failure and delay and hook it where appropriate.

Such helper triggers additional subflow creation, as needed
and updates the PM subflow counter, if the current one is
closing.

Additionally start all the needed additional subflows
as soon as the MPTCP socket is fully established, so we don't
have to cope with slow MPJ handshake blocking the next subflow
creation.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: keep track of local endpoint still available for each msk

Include into the path manager status a bitmap tracking the list
of local endpoints still available - not yet used - for the
relevant mptcp socket.

Keep such map updated at endpoint creation/deletion time, so
that we can easily skip already used endpoint at local address
selection time.

The endpoint used by the initial subflow is lazyly accounted at
subflow creation time: the usage bitmap is be up2date before
endpoint selection and we avoid such unneeded task in some relevant
scenarios - e.g. busy servers accepting incoming subflows but
not creating any additional ones nor annuncing additional addresses.

Overall this allows for fair local endpoints usage in case of
subflow failure.

As a side effect, this patch also enforces that each endpoint
is used at most once for each mptcp connection.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: clean-up MPJ option writing

Check for all MPJ variant at once, this reduces the number
of conditionals traversed on average and will simplify the
next patch.

No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: fix per socket endpoint accounting

Since full-mesh endpoint support, the reception of a single ADD_ADDR
option can cause multiple subflows creation. When such option is
accepted we increment 'add_addr_accepted' by one. When we received
a paired RM_ADDR option, we deleted all the relevant subflows,
decrementing 'add_addr_accepted' by one for each of them.

We have a similar issue for 'local_addr_used'

Fix them moving the pm endpoint accounting outside the subflow
traversal.

Fixes: eadd062c09df ("mptcp: local addresses fullmesh")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

selftests: mptcp: add disconnect tests

Performs several disconnect/reconnect on the same socket,
ensuring the overall transfer is succesful.

The new test leverages ioctl(SIOCOUTQ) to ensure all the
pending data is acked before disconnecting.

Additionally order alphabetically the test program arguments list
for better maintainability.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: implement support for user-space disconnect

Handle explicitly AF_UNSPEC in mptcp_stream_connnect() to
allow user-space to disconnect established MPTCP connections

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: cleanup accept and poll

After the previous patch, msk->subflow will never be deleted during
the whole msk lifetime. We don't need anymore to acquire references to
it in mptcp_stream_accept() and we can use the listener subflow accept
queue to simplify mptcp_poll() for listener socket.

Overall this removes a lock pair and 4 more atomic operations per
accept().

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: full disconnect implementation

The current mptcp_disconnect() implementation lacks several
steps, we additionally need to reset the msk socket state
and flush the subflow list.

Factor out the needed helper to avoid code duplication.

Additionally ensure that the initial subflow is disposed
only after mptcp_close(), just reset it at disconnect time.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: implement fastclose xmit path

Allow the MPTCP xmit path to add MP_FASTCLOSE suboption
on RST egress packets.

Additionally reorder related options writing to reduce
the number of conditionals required in the fast path.

Co-developed-by: Geliang Tang <geliang.tang@suse.com>
Signed-off-by: Geliang Tang <geliang.tang@suse.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

mptcp: keep snd_una updated for fallback socket

After shutdown, for fallback MPTCP sockets, we always have

write_seq == snd_una+1

The above will foul OUTQ ioctl(). Keep snd_una in sync with
write_seq even after shutdown.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>

Merge tag 'mlx5-updates-2022-01-06' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2022-01-06

1) Expose FEC per lane block counters via ethtool

2) Trivial fixes/updates/cleanup to mlx5e netdev driver

3) Fix htmldoc build warning

4) Spread mlx5 SFs (sub-functions) to all available CPU cores: Commits 1..5

Shay Drory Says:
================
Before this patchset, mlx5 subfunction shared the same IRQs (MSI-X) with
their peers subfunctions, causing them to use same CPU cores.

In large scale, this is very undesirable, SFs use small number of cpu
cores and all of them will be packed on the same CPU cores, not
utilizing all CPU cores in the system.

In this patchset we want to achieve two things.
a) Spread IRQs used by SFs to all cpu cores
b) Pack less SFs in the same IRQ, will result in multiple IRQs per core.

In this patchset, we spread SFs over all online cpus available to mlx5
irqs in Round-Robin manner. e.g.: Whenever a SF is created, pick the next
CPU core with least number of SF IRQs bound to it, SFs will share IRQs on
the same core until a certain limit, when such limit is reached, we
request a new IRQ and add it to that CPU core IRQ pool, when out of IRQs,
pick any IRQ with least number of SF users.

This enhancement is done in order to achieve a better distribution of
the SFs over all the available CPUs, which reduces application latency,
as shown bellow.

Machine details:
Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz with 56 cores.
PCI Express 3 with BW of 126 Gb/s.
ConnectX-5 Ex; EDR IB (100Gb/s) and 100GbE; dual-port QSFP28; PCIe4.0
x16.

Base line test description:
Single SF on the system. One instance of netperf is running on-top the
SF.
Numbers: latency = 15.136 usec, CPU Util = 35%

Test description:
There are 250 SFs on the system. There are 3 instances of netperf
running, on-top three different SFs, in parallel.

Perf numbers:
# netperf     SFs         latency(usec)     latency    CPU utilization
   affinity    affinity    (lower is better) increase %
1 cpu=0       cpu={0}     ~23 (app 1-3)     35%        75%
2 cpu=0,2,4   cpu={0}     app 1: 21.625     30%        68% (CPU 0)
                           app 2-3: 16.5     9%         15% (CPU 2,4)
3 cpu=0       cpu={0,2,4} app 1: ~16        7%         84% (CPU 0)
                           app 2-3: ~17.9    14%        22% (CPU 2,4)
4 cpu=0,2,4   cpu={0,2,4} 15.2 (app 1-3)    0%         33% (CPU 0,2,4)

- The first two entries (#1 and #2) show current state. e.g.: SFs are
   using the same CPU. The last two entries (#3 and #4) shows the latency
   reduction improvement of this patch. e.g.: SFs are on different CPUs.
- Whenever we use several CPUs, in case there is a different CPU
   utilization, write the utilization of each CPU separately.
- Whenever the latency result of the netperf instances were different,
   write the latency of each netperf instances separately.

Commands:
- for netperf CPU=0:
$ for i in {1..3}; do taskset -c 0 netperf -H 1${i}.1.1.1 -t TCP_RR  -- \
  -o RT_LATENCY -r8 & done

- for netperf CPU=0,2,4
$ for i in {1..3}; do taskset -c $(( ($i - 1) * 2  )) netperf -H \
  1${i}.1.1.1 -t TCP_RR  -- -o RT_LATENCY -r8 & done

================

====================

Signed-off-by: David S. Miller <davem@davemloft.net>

Bluetooth: hci_sock: fix endian bug in hci_sock_setsockopt()

This copies a u16 into the high bits of an int, which works on a big
endian system but not on a little endian system.

Fixes: 0d7587bfb3c0 ("Bluetooth: hci_sock: Add support for BT_{SND,RCV}BUF")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: L2CAP: uninitialized variables in l2cap_sock_setsockopt()

The "opt" variable is a u32, but on some paths only the top bytes
were initialized and the others contained random stack data.

Fixes: cf0058199c0b ("net: pass a sockptr_t into ->setsockopt")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: btqca: sequential validation

Added Sequential validation support & patch command config

Signed-off-by: Sai Teja Aluvala <quic_saluvala@quicinc.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: btusb: Add support for Foxconn QCA 0xe0d0

Add an ID of Qualcomm Bluetooth SoC WCN6855.

T:  Bus=05 Lev=01 Prnt=01 Port=03 Cnt=02 Dev#=  4 Spd=12   MxCh= 0
D:  Ver= 1.10 Cls=e0(wlcon) Sub=01 Prot=01 MxPS=64 #Cfgs=  1
P:  Vendor=0489 ProdID=e0d0 Rev= 0.01
C:* #Ifs= 2 Cfg#= 1 Atr=e0 MxPwr=100mA
I:* If#= 0 Alt= 0 #EPs= 3 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=81(I) Atr=03(Int.) MxPS=  16 Ivl=1ms
E:  Ad=82(I) Atr=02(Bulk) MxPS=  64 Ivl=0ms
E:  Ad=02(O) Atr=02(Bulk) MxPS=  64 Ivl=0ms
I:  If#= 1 Alt= 0 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=83(I) Atr=01(Isoc) MxPS=   0 Ivl=1ms
E:  Ad=03(O) Atr=01(Isoc) MxPS=   0 Ivl=1ms
I:  If#= 1 Alt= 1 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=83(I) Atr=01(Isoc) MxPS=   9 Ivl=1ms
E:  Ad=03(O) Atr=01(Isoc) MxPS=   9 Ivl=1ms
I:* If#= 1 Alt= 2 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=83(I) Atr=01(Isoc) MxPS=  17 Ivl=1ms
E:  Ad=03(O) Atr=01(Isoc) MxPS=  17 Ivl=1ms
I:  If#= 1 Alt= 3 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=83(I) Atr=01(Isoc) MxPS=  25 Ivl=1ms
E:  Ad=03(O) Atr=01(Isoc) MxPS=  25 Ivl=1ms
I:  If#= 1 Alt= 4 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=83(I) Atr=01(Isoc) MxPS=  33 Ivl=1ms
E:  Ad=03(O) Atr=01(Isoc) MxPS=  33 Ivl=1ms
I:  If#= 1 Alt= 5 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=83(I) Atr=01(Isoc) MxPS=  49 Ivl=1ms
E:  Ad=03(O) Atr=01(Isoc) MxPS=  49 Ivl=1ms
I:  If#= 1 Alt= 6 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=83(I) Atr=01(Isoc) MxPS=  63 Ivl=1ms
E:  Ad=03(O) Atr=01(Isoc) MxPS=  63 Ivl=1ms
I:  If#= 1 Alt= 7 #EPs= 2 Cls=e0(wlcon) Sub=01 Prot=01 Driver=btusb
E:  Ad=83(I) Atr=01(Isoc) MxPS=  65 Ivl=1ms
E:  Ad=03(O) Atr=01(Isoc) MxPS=  65 Ivl=1ms

Signed-off-by: Aaron Ma <aaron.ma@canonical.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Bluetooth: btintel: Fix broken LED quirk for legacy ROM devices

This patch fixes the broken LED quirk for Intel legacy ROM devices.
To fix the LED issue that doesn't turn off immediately, the host sends
the SW RFKILL command while shutting down the interface and it puts the
devices in SW RFKILL state.

Once the device is in SW RFKILL state, it can only accept HCI_Reset to
exit from the SW RFKILL state. This patch checks the quirk for broken
LED and sends the HCI_Reset before sending the HCI_Intel_Read_Version
command.

The affected legacy ROM devices are
- 8087:07dc
- 8087:0a2a
- 8087:0aa7

Fixes: acac36d3c055b ("Bluetooth: btintel: Fix the LED is not turning off immediately")
Signed-off-by: Tedd Ho-Jeong An <tedd.an@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>

Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue

Tony Nguyen says:

====================
100GbE Intel Wired LAN Driver Updates 2022-01-06

Victor adds restoring of advanced rules after reset.

Wojciech improves usage of switchdev control VSI by utilizing the
device's advanced rules for forwarding.

Christophe Jaillet removes some unneeded calls to zero bitmaps, changes
some bitmap operations that don't need to be atomic, and converts a
kfree() to a more appropriate bitmap_free().

* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
  ice: Use bitmap_free() to free bitmap
  ice: Optimize a few bitmap operations
  ice: Slightly simply ice_find_free_recp_res_idx
  ice: improve switchdev's slow-path
  ice: replay advanced rules after reset
====================

Link: https://lore.kernel.org/r/20220106183013.3777622-1-anthony.l.nguyen@intel.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Merge branch 'mlxsw-add-spectrum-4-support'

Ido Schimmel says:

====================
mlxsw: Add Spectrum-4 support

This patchset adds Spectrum-4 support in mlxsw. It builds on top of a
previous patchset merged in commit 7f7afa9679d4 ("Merge branch
'mlxsw-Spectrum-4-prep'") and makes two additional changes before adding
Spectrum-4 support.

Patchset overview:

Patches #1-#2 add a few Spectrum-4 specific variants of existing ACL
keys. The new variants are needed because the size of certain key
elements (e.g., local port) was increased in Spectrum-4.

Patches #3-#6 are preparations.

Patch #7 implements the Spectrum-4 variant of the Bloom filter hash
function. The Bloom filter is used to optimize ACL lookups by
potentially skipping certain lookups if they are guaranteed not to
match. See additional info in merge commit 71d1ced5466d ("Merge branch
'mlxsw-spectrum_acl-Add-Bloom-filter-support'").

Patch #8 finally adds Spectrum-4 support.
====================

Link: https://lore.kernel.org/r/20220106160652.821176-1-idosch@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum: Extend to support Spectrum-4 ASIC

Extend existing driver for Spectrum, Spectrum-2 and Spectrum-3 ASICs
to support Spectrum-4 ASIC as well.

Currently there is no released firmware version for Spectrum-4, so the
driver is not enforcing a minimum version.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_acl_bloom_filter: Add support for Spectrum-4 calculation

Spectrum-4 will calculate hash function for bloom filter differently
from the existing ASICs.

First, two hash functions will be used to calculate 16 bits result.
The final result will be combination of the two results - 6 bits which
are result of CRC-6 will be used as MSB and 10 bits which are result of
CRC-10 will be used as LSB.

Second, while in Spectrum{2,3}, there is a padding in each chunk, so the
chunks use a sequence of whole bytes, in Spectrum-4 there is no padding,
so each chunk use 20 bytes minus 2 bits, so it is necessary to align the
chunks to be without holes.

Add dedicated 'mlxsw_sp_acl_bf_ops' for Spectrum-4 and add the required
tables for CRC calculations.

All the details are documented as part of the code for future use.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: Add operations structure for bloom filter calculation

Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.

There are two changes:
1. Instead of using one hash function to calculate 16 bits output (CRC-16),
two functions will be used.
2. The chunks will be built differently, without padding.

As preparation for support of Spectrum-4 bloom filter, add 'ops'
structure to allow handling different calculation for different ASICs.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_acl_bloom_filter: Rename Spectrum-2 specific objects for future use

Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.

There are two changes:
1. Instead of using one hash function to calculate 16 bits output (CRC-16),
two functions will be used.
2. The chunks will be built differently, without padding.

As preparation for support of Spectrum-4 bloom filter, rename CRC table
to include "sp2" prefix and "crc16", as next patch will add two additional
tables. In addition, rename all the dedicated functions and defines for
Spectrum-{2,3} to include "sp2" prefix.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>

mlxsw: spectrum_acl_bloom_filter: Make mlxsw_sp_acl_bf_key_encode() more flexible

Spectrum-4 will calculate hash function for bloom filter differently from
the existing ASICs.

One of the changes is related to the way that the chunks will be build -
without padding.

As preparation for support of Spectrum-4 bloom filter, make
mlxsw_sp_acl_bf_key_encode() more flexible, so it will be able to use it
for Spectrum-4 as well.

Signed-off-by: Amit Cohen <amcohen@nvidia.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>