Jakub Kicinski [Fri, 29 Jul 2022 05:24:41 +0000 (22:24 -0700)]
Merge branch 'net-dsa-qca8k-code-split-for-qca8k'
Christian Marangi says:
====================
net: dsa: qca8k: code split for qca8k
This is needed ad ipq4019 SoC have an internal switch that is
based on qca8k with very minor changes. The general function is equal.
Because of this we split the driver to common and specific code.
As the common function needs to be moved to a different file to be
reused, we had to convert every remaining user of qca8k_read/write/rmw
to regmap variant.
We had also to generilized the special handling for the ethtool_stats
function that makes use of the autocast mib. (ipq4019 will have a
different tagger and use mmio so it could be quicker to use mmio instead
of automib feature)
And we had to convert the regmap read/write to bulk implementation to
drop the special function that makes use of it. This will be compatible
with ipq4019 and at the same time permits normal switch to use the eth
mgmt way to send the entire ATU table read/write in one go.
====================
net: dsa: qca8k: move read_switch_id function to common code
The same function to read the switch id is used by drivers based on
qca8k family switch. Move them to common code to make them accessible
also by other drivers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move port LAG functions to common code
The same port LAG functions are used by drivers based on qca8k family
switch. Move them to common code to make them accessible also by other
drivers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move port VLAN functions to common code
The same port VLAN functions are used by drivers based on qca8k family
switch. Move them to common code to make them accessible also by other
drivers.
Also drop exposing busy_wait and make it static.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move port mirror functions to common code
The same port mirror functions are used by drivers based on qca8k family
switch. Move them to common code to make them accessible also by other
drivers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move port FDB/MDB function to common code
The same port FDB/MDB function are used by drivers based on qca8k family
switch. Move them to common code to make them accessible also by other
drivers.
Also drop bulk read/write functions and make them static
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move set age/MTU/port enable/disable functions to common code
The same set age, MTU and port enable/disable function are used by
driver based on qca8k family switch.
Move them to common code to make them accessible also by other drivers.
While at it also drop unnecessary qca8k_priv cast for void pointers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move bridge functions to common code
The same bridge functions are used by drivers based on qca8k family
switch. Move them to common code to make them accessible also by other
drivers.
While at it also drop unnecessary qca8k_priv cast for void pointers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move port set status/eee/ethtool stats function to common code
The same logic to disable/enable port, set eee and get ethtool stats is
used by drivers based on qca8k family switch.
Move it to common code to make it accessible also by other drivers.
While at it also drop unnecessary qca8k_priv cast for void pointers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move mib init function to common code
The same mib function is used by drivers based on qca8k family switch.
Move it to common code to make it accessible also by other drivers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move qca8k bulk read/write helper to common code
The same ATU function are used by drivers based on qca8k family switch.
Move the bulk read/write helper to common code to declare these shared
ATU functions in common code.
These helper will be dropped when regmap correctly support bulk
read/write.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: move qca8k read/write/rmw and reg table to common code
The same reg table and read/write/rmw function are used by drivers
based on qca8k family switch.
Move them to common code to make it accessible also by other drivers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The same MIB struct is used by drivers based on qca8k family switch. Move
it to common code to make it accessible also by other drivers.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: make mib autocast feature optional
Some switch may not support mib autocast feature and require the legacy
way of reading the regs directly.
Make the mib autocast feature optional and permit to declare support for
it using match_data struct in a dedicated qca8k_info_ops struct.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: dsa: qca8k: cache match data to speed up access
Using of_device_get_match_data is expensive. Cache match data to speed
up access and rework user of match data to use the new cached value.
Signed-off-by: Christian Marangi <ansuelsmth@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
selftests: net: dsa: Add a Makefile which installs the selftests
Add a Makefile which takes care of installing the selftests in
tools/testing/selftests/drivers/net/dsa. This can be used to install all
DSA specific selftests and forwarding.config using the same approach as
for the selftests in tools/testing/selftests/net/forwarding.
====================
Take devlink lock on mlx4 and mlx5 callbacks
Prepare mlx4 and mlx5 drivers to have all devlink callbacks called with
devlink instance locked. Change mlx4 driver to use devl_ API where
needed to have devlink reload callbacks locked. Change mlx5 driver to
use devl_ API where needed to have devlink reload and devlink health
callbacks locked.
As mlx5 is the only driver which needed changes to enable calling health
callbacks with devlink instance locked, this patchset also removes
DEVLINK_NL_FLAG_NO_LOCK flag from devlink health callbacks.
This patchset will be followed by a patchset that will remove
DEVLINK_NL_FLAG_NO_LOCK flag from devlink and will remove devlink_mutex.
====================
net/mlx5: Lock mlx5 devlink health recovery callback
Change devlink instance locks in mlx5 driver to have devlink health
recovery callback locked, while keeping all driver paths which lead to
devl_ API functions called by the driver locked.
Change devlink instance locks in mlx4 driver to have devlink reload
callback locked, while keeping all driver paths which leads to devl_ API
functions called by the mlx4 driver locked.
net/mlx4: Use devl_ API for devlink port register / unregister
Use devl_ API to call devl_port_register() and devl_port_unregister()
instead of devlink_port_register() and devlink_port_unregister(). Add
devlink instance lock in mlx4 driver paths to these functions.
This will be used by the downstream patch to invoke mlx4 devlink reload
callbacks with devlink lock held.
net/mlx4: Use devl_ API for devlink region create / destroy
Use devl_ API to call devl_region_create() and devl_region_destroy()
instead of devlink_region_create() and devlink_region_destroy().
Add devlink instance lock in mlx4 driver paths to these functions.
This will be used by the downstream patch to invoke mlx4 devlink reload
callbacks with devlink lock held.
Change devlink instance locks in mlx5 driver to have devlink reload
callbacks locked, while keeping all driver paths which lead to devl_ API
functions called by the driver locked.
Add mlx5_load_one_devl_locked() and mlx5_unload_one_devl_locked() which
are used by the paths which are already locked such as devlink reload
callbacks.
This patch makes the driver use devl_ API also for traps register as
these functions are called from the driver paths parallel to reload that
requires locking now.
net/mlx5: Move fw reset unload to mlx5_fw_reset_complete_reload
Refactor fw reset code to have the unload driver part done on
mlx5_fw_reset_complete_reload(), so if it was called by the PF which
initiated the reload fw activate flow, the unload part will be handled
by the mlx5_devlink_reload_fw_activate() callback itself and not by the
reset event work.
This will be used by the downstream patch to invoke devlink reload
callbacks with devlink lock held.
net: devlink: remove region snapshots list dependency on devlink->lock
After mlx4 driver is converted to do locked reload,
devlink_region_snapshot_create() may be called from both locked and
unlocked context.
Note that in mlx4 region snapshots could be created on any command
failure. That can happen in any flow that involves commands to FW,
which means most of the driver flows.
So resolve this by removing dependency on devlink->lock for region
snapshots list consistency and introduce new mutex to ensure it.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: devlink: remove region snapshot ID tracking dependency on devlink->lock
After mlx4 driver is converted to do locked reload, functions to get/put
regions snapshot ID may be called from both locked and unlocked context.
So resolve this by removing dependency on devlink->lock for region
snapshot ID tracking by using internal xa_lock() to maintain
shapshot_ids xa_array consistency.
Signed-off-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
add framework for selftests in devlink
Add support for selftests in the devlink framework.
Adds a callback .selftests_check and .selftests_run in devlink_ops.
User can add test(s) suite which is subsequently passed to the driver
and driver can opt for running particular tests based on its capabilities.
Patchset adds a flash based test for the bnxt_en driver.
====================
Add a framework for running selftests.
Framework exposes devlink commands and test suite(s) to the user
to execute and query the supported tests by the driver.
Below are new entries in devlink_nl_ops
devlink_nl_cmd_selftests_show_doit/dumpit: To query the supported
selftests by the drivers.
devlink_nl_cmd_selftests_run: To execute selftests. Users can
provide a test mask for executing group tests or standalone tests.
Documentation/networking/devlink/ path is already part of MAINTAINERS &
the new files come under this path. Hence no update needed to the
MAINTAINERS
Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com> Reviewed-by: Andy Gospodarek <gospo@broadcom.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
mlx5e use TLS TX pool to improve connection rate
To offload encryption operations, the mlx5 device maintains state and
keeps track of every kTLS device-offloaded connection. Two HW objects
are used per TX context of a kTLS offloaded connection: a. Transport
interface send (TIS) object, to reach the HW context. b. Data Encryption
Key (DEK) to perform the crypto operations.
These two objects are created and destroyed per TLS TX context, via FW
commands. In total, 4 FW commands are issued per TLS TX context, which
seriously limits the connection rate.
In this series, we aim to save creation and destroy of TIS objects by
recycling them. Upon recycling of a TIS, the HW still needs to be
notified for the re-mapping between a TIS and a context. This is done by
posting WQEs via an SQ, significantly faster API than the FW command
interface.
A pool is used for recycling. The pool dynamically interacts to the load
and connection rate, growing and shrinking accordingly.
Saving the TIS FW commands per context increases connection rate by ~42%,
from 11.6K to 16.5K connections per sec.
Connection rate is still limited by FW bottleneck due to the remaining
per context FW commands (DEK create/destroy). This will soon be addressed
in a followup series. By combining the two series, the FW bottleneck
will be released, and a significantly higher (about 100K connections per
sec) kTLS TX device-offloaded connection rate is reached.
====================
net/mlx5e: kTLS, Dynamically re-size TX recycling pool
Let the TLS TX recycle pool be more flexible in size, by continuously
and dynamically allocating and releasing HW resources in response to
changes in the connections rate and load.
Allocate and release pool entries in bulks (16). Use a workqueue to
release/allocate in the background. Allocate a new bulk when the pool
size goes lower than the low threshold (1K). Symmetric operation is done
when the pool size gets greater than the upper threshold (4K).
Every idle pool entry holds: 1 TIS, 1 DEK (HW resources), in addition to
~100 bytes in host memory.
Start with an empty pool to minimize memory and HW resources waste for
non-TLS users that have the device-offload TLS enabled.
Upon a new request, in case the pool is empty, do not wait for a whole bulk
allocation to complete. Instead, trigger an instant allocation of a single
resource to reduce latency.
net/mlx5e: kTLS, Recycle objects of device-offloaded TLS TX connections
The transport interface send (TIS) object is responsible for performing
all transport related operations of the transmit side. The ConnectX HW
uses a TIS object to save and access the TLS crypto information and state
of an offloaded TX kTLS connection.
Before this patch, we used to create a new TIS per connection and destroy
it once it’s closed. Every create and destroy of a TIS is a FW command.
Same applies for the private TLS context, where we used to dynamically
allocate and free it per connection.
Resources recycling reduce the impact of the allocation/free operations
and helps speeding up the connection rate.
In this feature we maintain a pool of TX objects and use it to recycle
the resources instead of re-creating them per connection.
A cached TIS popped from the pool is updated to serve the new connection
via the fast-path HW interface, updating the tls static and progress
params. This is a very fast operation, significantly faster than FW
commands.
On recycling, a WQE fence is required after the context params change.
This guarantees that the data is sent after the context has been
successfully updated in hardware, and that the context modification
doesn't interfere with existing traffic.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Let the caller of mlx5e_ktls_tx_handle_ooo() take care of updating the
stats, according to the returned value. As the switch/case blocks are
already there, this change saves unnecessary branches in the handler.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
TLS TIS objects have a defined role in mapping and reaching the HW TLS
contexts. Some standard TIS attributes (like LAG port affinity) are
not relevant for them.
Use a dedicated TLS TIS create function instead of the generic
mlx5e_create_tis.
Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Gal Pressman <gal@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Multiple TLS device-offloaded contexts can be added in parallel via
concurrent calls to .tls_dev_add, while calls to .tls_dev_del are
sequential in tls_device_gc_task.
This is not a sustainable behavior. This creates a rate gap between add
and del operations (addition rate outperforms the deletion rate). When
running for enough time, the TLS device resources could get exhausted,
failing to offload new connections.
Replace the single-threaded garbage collector work with a per-context
alternative, so they can be handled on several cores in parallel. Use
a new dedicated destruct workqueue for this.
net/tls: Perform immediate device ctx cleanup when possible
TLS context destructor can be run in atomic context. Cleanup operations
for device-offloaded contexts could require access and interaction with
the device callbacks, which might sleep. Hence, the cleanup of such
contexts must be deferred and completed inside an async work.
For all others, this is not necessary, as cleanup is atomic. Invoke
cleanup immediately for them, avoiding queueing redundant gc work.
Yang Li [Thu, 28 Jul 2022 03:10:19 +0000 (11:10 +0800)]
tls: rx: Fix unsigned comparison with less than zero
The return from the call to tls_rx_msg_size() is int, it can be
a negative error code, however this is being assigned to an
unsigned long variable 'sz', so making 'sz' an int.
Eliminate the following coccicheck warning:
./net/tls/tls_strp.c:211:6-8: WARNING: Unsigned expression compared with zero: sz < 0
Jakub Kicinski [Fri, 29 Jul 2022 04:50:02 +0000 (21:50 -0700)]
Merge branch 'tls-rx-follow-ups-to-rx-work'
Jakub Kicinski says:
====================
tls: rx: follow ups to rx work
A selection of unrelated changes. First some selftest polishing.
Next a change to rcvtimeo handling for locking based on an exchange
with Eric. Follow up to Paolo's comments from yesterday. Last but
not least a fix to a false positive warning, turns out I've been
testing with DEBUG_NET=n this whole time.
====================
Jakub Kicinski [Wed, 27 Jul 2022 03:15:24 +0000 (20:15 -0700)]
tls: rx: fix the false positive warning
I went too far in the accessor conversion, we can't use tls_strp_msg()
after decryption because the message may not be ready. What we care
about on this path is that the output skb is detached, i.e. we didn't
somehow just turn around and used the input skb with its TCP data
still attached. So look at the anchor directly.
Fixes: 37c2526e223d ("tls: rx: do not use the standard strparser") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Wed, 27 Jul 2022 03:15:23 +0000 (20:15 -0700)]
tls: strp: rename and multithread the workqueue
Paolo points out that there seems to be no strong reason strparser
users a single threaded workqueue. Perhaps there were some performance
or pinning considerations? Since we don't know (and it's the slow path)
let's default to the most natural, multi-threaded choice.
Also rename the workqueue to "tls-".
Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Eric indicates that restarting rcvtimeo on every wait may be fine.
I thought that we should consider it cumulative, and made
tls_rx_reader_lock() return the remaining timeo after acquiring
the reader lock.
tls_rx_rec_wait() gets its timeout passed in by value so it
does not keep track of time previously spent.
Make the lock waiting consistent with tls_rx_rec_wait() - don't
keep track of time spent.
Read the timeo fresh in tls_rx_rec_wait().
It's unclear to me why callers are supposed to cache the value.
Jakub Kicinski [Wed, 27 Jul 2022 03:15:21 +0000 (20:15 -0700)]
selftests: tls: handful of memrnd() and length checks
Add a handful of memory randomizations and precise length checks.
Nothing is really broken here, I did this to increase confidence
when debugging. It does fix a GCC warning, tho. Apparently GCC
recognizes that memory needs to be initialized for send() but
does not recognize that for write().
* tag 'net-5.19-final' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (67 commits)
stmmac: dwmac-mediatek: fix resource leak in probe
ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr
net: ping6: Fix memleak in ipv6_renew_options().
net/funeth: Fix fun_xdp_tx() and XDP packet reclaim
sctp: leave the err path free in sctp_stream_init to sctp_stream_free
sfc: disable softirqs for ptp TX
ptp: ocp: Select CRC16 in the Kconfig.
tcp: md5: fix IPv4-mapped support
virtio-net: fix the race between refill work and close
mptcp: Do not return EINPROGRESS when subflow creation succeeds
Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
Bluetooth: Always set event mask on suspend
Bluetooth: mgmt: Fix double free on error path
wifi: mac80211: do not abuse fq.lock in ieee80211_do_stop()
ice: do not setup vlan for loopback VSI
ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
ice: Fix VSIs unable to share unicast MAC
ice: Fix tunnel checksum offload with fragmented traffic
ice: Fix max VLANs available for VF
netfilter: nft_queue: only allow supported familes and hooks
...
Ziyang Xuan [Thu, 28 Jul 2022 01:33:07 +0000 (09:33 +0800)]
ipv6/addrconf: fix a null-ptr-deref bug for ip6_ptr
Change net device's MTU to smaller than IPV6_MIN_MTU or unregister
device while matching route. That may trigger null-ptr-deref bug
for ip6_ptr probability as following.
=========================================================
BUG: KASAN: null-ptr-deref in find_match.part.0+0x70/0x134
Read of size 4 at addr 0000000000000308 by task ping6/263
Reproducer as following:
Firstly, prepare conditions:
$ip netns add ns1
$ip netns add ns2
$ip link add veth1 type veth peer name veth2
$ip link set veth1 netns ns1
$ip link set veth2 netns ns2
$ip netns exec ns1 ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1
$ip netns exec ns2 ip -6 addr add 2001:0db8:0:f101::2/64 dev veth2
$ip netns exec ns1 ifconfig veth1 up
$ip netns exec ns2 ifconfig veth2 up
$ip netns exec ns1 ip -6 route add 2000::/64 dev veth1 metric 1
$ip netns exec ns2 ip -6 route add 2001::/64 dev veth2 metric 1
Secondly, execute the following two commands in two ssh windows
respectively:
$ip netns exec ns1 sh
$while true; do ip -6 addr add 2001:0db8:0:f101::1/64 dev veth1; ip -6 route add 2000::/64 dev veth1 metric 1; ping6 2000::2; done
$ip netns exec ns1 sh
$while true; do ip link set veth1 mtu 1000; ip link set veth1 mtu 1500; sleep 5; done
It is because ip6_ptr has been assigned to NULL in addrconf_ifdown() firstly,
then ip6_ignore_linkdown() accesses ip6_ptr directly without NULL check.
When we close ping6 sockets, some resources are left unfreed because
pingv6_prot is missing sk->sk_prot->destroy(). As reported by
syzbot [0], just three syscalls leak 96 bytes and easily cause OOM.
struct ipv6_sr_hdr *hdr;
char data[24] = {0};
int fd;
To fix memory leaks, let's add a destroy function.
Note the socket() syscall checks if the GID is within the range of
net.ipv4.ping_group_range. The default value is [1, 0] so that no
GID meets the condition (1 <= GID <= 0). Thus, the local DoS does
not succeed until we change the default value. However, at least
Ubuntu/Fedora/RHEL loosen it.
watch_queue: Fix missing locking in add_watch_to_object()
If a watch is being added to a queue, it needs to guard against
interference from addition of a new watch, manual removal of a watch and
removal of a watch due to some other queue being destroyed.
KEYCTL_WATCH_KEY guards against this for the same {key,queue} pair by
holding the key->sem writelocked and by holding refs on both the key and
the queue - but that doesn't prevent interaction from other {key,queue}
pairs.
While add_watch_to_object() does take the spinlock on the event queue,
it doesn't take the lock on the source's watch list. The assumption was
that the caller would prevent that (say by taking key->sem) - but that
doesn't prevent interference from the destruction of another queue.
Fix this by locking the watcher list in add_watch_to_object().
Fixes: a23d9805b5b3 ("pipe: Add general notification queue support") Reported-by: syzbot+03d7b43290037d1f87ca@syzkaller.appspotmail.com Signed-off-by: David Howells <dhowells@redhat.com>
cc: keyrings@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Thu, 28 Jul 2022 09:31:06 +0000 (10:31 +0100)]
watch_queue: Fix missing rcu annotation
Since __post_watch_notification() walks wlist->watchers with only the
RCU read lock held, we need to use RCU methods to add to the list (we
already use RCU methods to remove from the list).
Fix add_watch_to_object() to use hlist_add_head_rcu() instead of
hlist_add_head() for that list.
Fixes: a23d9805b5b3 ("pipe: Add general notification queue support") Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
net: cdns,macb: use correct xlnx prefix for Xilinx
Use correct vendor for Xilinx versions of Cadence MACB/GEM Ethernet
controller. The Versal compatible was not released, so it can be
changed. Zynq-7xxx and Ultrascale+ has to be kept in new and deprecated
form.
dt-bindings: net: cdns,macb: use correct xlnx prefix for Xilinx
Use correct vendor for Xilinx versions of Cadence MACB/GEM Ethernet
controller. The Versal compatible was not released, so it can be
changed. Zynq-7xxx and Ultrascale+ has to be kept in new and deprecated
form.
net/funeth: Fix fun_xdp_tx() and XDP packet reclaim
The current implementation of fun_xdp_tx(), used for XPD_TX, is
incorrect in that it takes an address/length pair and later releases it
with page_frag_free(). It is OK for XDP_TX but the same code is used by
ndo_xdp_xmit. In that case it loses the XDP memory type and releases the
packet incorrectly for some of the types. Assorted breakage follows.
Change fun_xdp_tx() to take xdp_frame and rely on xdp_return_frame() in
reclaim.
Paolo Abeni [Thu, 28 Jul 2022 09:54:56 +0000 (11:54 +0200)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue
Tony Nguyen says:
====================
ice: PPPoE offload support
Marcin Szycik says:
Add support for dissecting PPPoE and PPP-specific fields in flow dissector:
PPPoE session id and PPP protocol type. Add support for those fields in
tc-flower and support offloading PPPoE. Finally, add support for hardware
offload of PPPoE packets in switchdev mode in ice driver.
Example filter:
tc filter add dev $PF1 ingress protocol ppp_ses prio 1 flower pppoe_sid \
1234 ppp_proto ip skip_sw action mirred egress redirect dev $VF1_PR
Changes in iproute2 are required to use the new fields (will be submitted
soon).
ICE COMMS DDP package is required to create a filter in ice.
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue:
ice: Add support for PPPoE hardware offload
flow_offload: Introduce flow_match_pppoe
net/sched: flower: Add PPPoE filter
flow_dissector: Add PPPoE dissectors
====================
Jakub Kicinski [Tue, 26 Jul 2022 21:56:52 +0000 (14:56 -0700)]
add missing includes and forward declarations to networking includes under linux/
Similarly to a recent include/net/ cleanup, this patch adds
missing includes to networking headers under include/linux.
All these problems are currently masked by the existing users
including the missing dependency before the broken header.
Marcin Wojtas [Tue, 26 Jul 2022 23:09:18 +0000 (01:09 +0200)]
net: dsa: mv88e6xxx: fix speed setting for CPU/DSA ports
Commit 210e28a51213 ("net: dsa: mv88e6xxx: get rid of SPEED_MAX setting")
stopped relying on SPEED_MAX constant and hardcoded speed settings
for the switch ports and rely on phylink configuration.
It turned out, however, that when the relevant code is called,
the mac_capabilites of CPU/DSA port remain unset.
mv88e6xxx_setup_port() is called via mv88e6xxx_setup() in
dsa_tree_setup_switches(), which precedes setting the caps in
phylink_get_caps down in the chain of dsa_tree_setup_ports().
As a result the mac_capabilites are 0 and the default speed for CPU/DSA
port is 10M at the start. To fix that, execute mv88e6xxx_get_caps()
and obtain the capabilities driectly.
Fixes: 210e28a51213 ("net: dsa: mv88e6xxx: get rid of SPEED_MAX setting") Signed-off-by: Marcin Wojtas <mw@semihalf.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20220726230918.2772378-1-mw@semihalf.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Thu, 28 Jul 2022 02:56:28 +0000 (19:56 -0700)]
Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2022-07-26
This series contains updates to ice driver only.
Przemyslaw corrects accounting for VF VLANs to allow for correct number
of VLANs for untrusted VF. He also correct issue with checksum offload
on VXLAN tunnels.
Ani allows for two VSIs to share the same MAC address.
Maciej corrects checked bits for descriptor completion of loopback
* '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
ice: do not setup vlan for loopback VSI
ice: check (DD | EOF) bits on Rx descriptor rather than (EOP | RS)
ice: Fix VSIs unable to share unicast MAC
ice: Fix tunnel checksum offload with fragmented traffic
ice: Fix max VLANs available for VF
====================
This happens when calling sctp_sendmsg without connecting to server first.
In this case, a data chunk already queues up in send queue of client side
when processing the INIT_ACK from server in sctp_process_init() where it
calls sctp_stream_init() to alloc stream_in. If it fails to alloc stream_in
all stream_out will be freed in sctp_stream_init's err path. Then in the
asoc freeing it will crash when dequeuing this data chunk as stream_out
is missing.
As we can't free stream out before dequeuing all data from send queue, and
this patch is to fix it by moving the err path stream_out/in freeing in
sctp_stream_init() to sctp_stream_free() which is eventually called when
freeing the asoc in sctp_association_free(). This fix also makes the code
in sctp_process_init() more clear.
Note that in sctp_association_init() when it fails in sctp_stream_init(),
sctp_association_free() will not be called, and in that case it should
go to 'stream_free' err path to free stream instead of 'fail_init'.
Sending a PTP packet can imply to use the normal TX driver datapath but
invoked from the driver's ptp worker. The kernel generic TX code
disables softirqs and preemption before calling specific driver TX code,
but the ptp worker does not. Although current ptp driver functionality
does not require it, there are several reasons for doing so:
1) The invoked code is always executed with softirqs disabled for non
PTP packets.
2) Better if a ptp packet transmission is not interrupted by softirq
handling which could lead to high latencies.
3) netdev_xmit_more used by the TX code requires preemption to be
disabled.
Indeed a solution for dealing with kernel preemption state based on static
kernel configuration is not possible since the introduction of dynamic
preemption level configuration at boot time using the static calls
functionality.
Eric Dumazet [Tue, 26 Jul 2022 11:57:43 +0000 (11:57 +0000)]
tcp: md5: fix IPv4-mapped support
After the blamed commit, IPv4 SYN packets handled
by a dual stack IPv6 socket are dropped, even if
perfectly valid.
$ nstat | grep MD5
TcpExtTCPMD5Failure 5 0.0
For a dual stack listener, an incoming IPv4 SYN packet
would call tcp_inbound_md5_hash() with @family == AF_INET,
while tp->af_specific is pointing to tcp_sock_ipv6_specific.
Only later when an IPv4-mapped child is created, tp->af_specific
is changed to tcp_sock_ipv6_mapped_specific.
Fixes: 1ded5af4601a ("net/tcp: Merge TCP-MD5 inbound callbacks") Reported-by: Brian Vazquez <brianvv@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Reviewed-by: Dmitry Safonov <dima@arista.com> Tested-by: Leonard Crestez <cdleonard@gmail.com> Link: https://lore.kernel.org/r/20220726115743.2759832-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Merge tag 'asm-generic-fixes-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull asm-generic fixes from Arnd Bergmann:
"Two more bug fixes for asm-generic, one addressing an incorrect
Kconfig symbol reference and another one fixing a build failure for
the perf tool on mips and possibly others"
* tag 'asm-generic-fixes-5.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
asm-generic: remove a broken and needless ifdef conditional
tools: Fixed MIPS builds due to struct flock re-definition
Stefan Raspl [Mon, 25 Jul 2022 14:10:00 +0000 (16:10 +0200)]
net/smc: Enable module load on netlink usage
Previously, the smc and smc_diag modules were automatically loaded as
dependencies of the ism module whenever an ISM device was present.
With the pending rework of the ISM API, the smc module will no longer
automatically be loaded in presence of an ISM device. Usage of an AF_SMC
socket will still trigger loading of the smc modules, but usage of a
netlink socket will not.
This is addressed by setting the correct module aliases.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 25 Jul 2022 14:09:59 +0000 (16:09 +0200)]
net/smc: Pass on DMBE bit mask in IRQ handler
Make the DMBE bits, which are passed on individually in ism_move() as
parameter idx, available to the receiver.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Stefan Raspl [Mon, 25 Jul 2022 14:09:58 +0000 (16:09 +0200)]
s390/ism: Cleanups
Reworked signature of the function to retrieve the system EID: No plausible
reason to use a double pointer. And neither to pass in the device as an
argument, as this identifier is by definition per system, not per device.
Plus some minor consistency edits.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This struct is used in a single place only, and its usage generates
inefficient code. Time to clean up!
Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-and-tested-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Wenjia Zhang < wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jason Wang [Mon, 25 Jul 2022 07:21:59 +0000 (15:21 +0800)]
virtio-net: fix the race between refill work and close
We try using cancel_delayed_work_sync() to prevent the work from
enabling NAPI. This is insufficient since we don't disable the source
of the refill work scheduling. This means an NAPI poll callback after
cancel_delayed_work_sync() can schedule the refill work then can
re-enable the NAPI that leads to use-after-free [1].
Since the work can enable NAPI, we can't simply disable NAPI before
calling cancel_delayed_work_sync(). So fix this by introducing a
dedicated boolean to control whether or not the work could be
scheduled from NAPI.
[1]
==================================================================
BUG: KASAN: use-after-free in refill_work+0x43/0xd4
Read of size 2 at addr ffff88810562c92e by task kworker/2:1/42
Fixes: 508106e7cdbb9 ("virtio_net: set/cancel work on ndo_open/ndo_stop") Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 27 Jul 2022 08:39:17 +0000 (09:39 +0100)]
Merge branch 'dsa-microchip-phylink-mac-config'
Arun Ramadoss says:
====================
net: dsa: microchip: add support for phylink mac config and link up
This patch series add support common phylink mac config and link up for the ksz
series switches. At present, ksz8795 and ksz9477 doesn't implement the phylink
mac config and link up. It configures the mac interface in the port setup hook.
ksz8830 series switch does not mac link configuration. For lan937x switches, in
the part support patch series has support only for MII and RMII configuration.
Some group of switches have some register address and bit fields common and
others are different. So, this patch aims to have common phylink implementation
which configures the register based on the chip id.
Changes in v2
- combined the modification of duplex, tx_pause and rx_pause into single
function.
Changes in v1
- Squash the reading rgmii value from dt to patch which apply the rgmii value
- Created the new function ksz_port_set_xmii_speed
- Seperated the namespace values for xmii_ctrl_0 and xmii_ctrl_1 register
- Applied the rgmii delay value based on the rx/tx-internal-delay-ps
====================
Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: add support for phylink mac config
This patch add support for phylink mac config for ksz series of
switches. All the files ksz8795, ksz9477 and lan937x uses the ksz common
xmii function. Instead of calling from the individual files, it is moved
to the ksz common phylink mac config function.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: ksz9477: use common xmii function
In ksz9477.c file, configuring the xmii register is performed based on
the flag NEW_XMII. The flag is reset for ksz9893 switch and set for
other switch. This patch uses the ksz common xmii set and get function.
The bit values are configured based on the chip id.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: apply rgmii tx and rx delay in phylink mac config
This patch read the rgmii tx and rx delay from device tree and stored it
in the ksz_port. It applies the rgmii delay to the xmii tune adjust
register based on the interface selected in phylink mac config. There
are two rgmii port in LAN937x and value to be loaded in the register
vary depends on the port selected.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: lan937x: add support for configuing xMII register
This patch add the common ksz_set_xmii function for ksz series switch
and update the lan937x code phylink mac config. The register address for
the ksz8795 is Port 5 Interface control 6 and for all other switch is
xMII Control 1.
The bit value for selecting the interface is same for
KSZ8795 and KSZ9893 are same. The bit values for KSZ9477 and lan973x are
same. So, this patch add the bit value for each switches in
ksz_chip_data and configure the registers based on the chip id.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: add support for common phylink mac link up
This patch add the support for common phylink mac link up for the ksz
series switch. The register address, bit position and values are
configured based on the chip id to the dev->info structure.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: add common duplex and flow control function
This patch add common function for configuring the Full/Half duplex and
transmit/receive flow control. KSZ8795 uses the Global control register
4 for configuring the duplex and flow control, whereas all other KSZ9477
based switch uses the xMII Control 0 register.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: add common ksz port xmii speed selection function
This patch adds the function for configuring the 100/10Mbps speed
selection for the ksz switches. KSZ8795 switch uses Global control 4
register 0x06 bit 4 for choosing 100/10Mpbs. Other switches uses xMII
control 1 0xN300 for it.
For KSZ8795, if the bit is set then 10Mbps is chosen and if bit is
clear then 100Mbps chosen. For all other switches it is other way
around, if the bit is set then 100Mbps is chosen.
So, this patch add the generic function for ksz switch to select the
100/10Mbps speed selection. While configuring, first it disables the
gigabit functionality and then configure the respective speed.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: microchip: add common gigabit set and get function
This patch add helper function for setting and getting the gigabit
enable for the ksz series switch. KSZ8795 switch has different register
address compared to all other ksz switches. KSZ8795 series uses the Port
5 Interface control 6 Bit 6 for configuring the 1Gbps or 100/10Mbps
speed selection. All other switches uses the xMII control 1 0xN301
register Bit6 for gigabit.
Further, for KSZ8795 & KSZ9893 switches if bit 1 then 1Gbps is chosen
and if bit 0 then 100/10Mbps is chosen. It is other way around for
other switches bit 0 is for 1Gbps. So, this patch implements the common
function for configuring the gigabit set and get capability.
Signed-off-by: Arun Ramadoss <arun.ramadoss@microchip.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 25 Jul 2022 20:05:54 +0000 (13:05 -0700)]
ip6mr: remove stray rcu_read_unlock() from ip6_mr_forward()
One rcu_read_unlock() should have been removed in blamed commit.
Fixes: 9201eeac94f3 ("ip6mr: do not acquire mrt_lock while calling ip6_mr_forward()") Reported-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20220725200554.2563581-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Mat Martineau [Mon, 25 Jul 2022 20:52:31 +0000 (13:52 -0700)]
mptcp: Do not return EINPROGRESS when subflow creation succeeds
New subflows are created within the kernel using O_NONBLOCK, so
EINPROGRESS is the expected return value from kernel_connect().
__mptcp_subflow_connect() has the correct logic to consider EINPROGRESS
to be a successful case, but it has also used that error code as its
return value.
Before v5.19 this was benign: all the callers ignored the return
value. Starting in v5.19 there is a MPTCP_PM_CMD_SUBFLOW_CREATE generic
netlink command that does use the return value, so the EINPROGRESS gets
propagated to userspace.
Make __mptcp_subflow_connect() always return 0 on success instead.
Fixes: fa7e6d3bc859 ("mptcp: Add handling of outgoing MP_JOIN requests") Fixes: 0c16c3db9b00 ("mptcp: netlink: allow userspace-driven subflow establishment") Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20220725205231.87529-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
1) If nf_queue user requests packet truncation below size of l3 header,
we corrupt the skb, then crash. Reject such requests.
2) add cond_resched() calls when doing cycle detection in the
nf_tables graph. This avoids softlockup warning with certain
rulesets.
3) Reject rulesets that use nftables 'queue' expression in family/chain
combinations other than those that are supported. Currently the ruleset
will load, but when userspace attempts to reinject you get WARN splat +
packet drops.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nft_queue: only allow supported familes and hooks
netfilter: nf_tables: add rescheduling points during loop detection walks
netfilter: nf_queue: do not allow packet truncation below transport header offset
====================
Jakub Kicinski [Wed, 27 Jul 2022 02:48:24 +0000 (19:48 -0700)]
Merge tag 'for-net-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fix early wakeup after suspend
- Fix double free on error
- Fix use-after-free on l2cap_chan_put
* tag 'for-net-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put
Bluetooth: Always set event mask on suspend
Bluetooth: mgmt: Fix double free on error path
====================
Merge tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull misc fixes from Andrew Morton:
"Thirteen hotfixes.
Eight are cc:stable and the remainder are for post-5.18 issues or are
too minor to warrant backporting"
* tag 'mm-hotfixes-stable-2022-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm:
mailmap: update Gao Xiang's email addresses
userfaultfd: provide properly masked address for huge-pages
Revert "ocfs2: mount shared volume without ha stack"
hugetlb: fix memoryleak in hugetlb_mcopy_atomic_pte
fs: sendfile handles O_NONBLOCK of out_fd
ntfs: fix use-after-free in ntfs_ucsncmp()
secretmem: fix unhandled fault in truncate
mm/hugetlb: separate path for hwpoison entry in copy_hugetlb_page_range()
mm: fix missing wake-up event for FSDAX pages
mm: fix page leak with multiple threads mapping the same page
mailmap: update Seth Forshee's email address
tmpfs: fix the issue that the mount and remount results are inconsistent.
mm: kfence: apply kmemleak_ignore_phys on early allocated pool
I've been in Alibaba Cloud for more than one year, mainly to address
cloud-native challenges (such as high-performance container images) for
open source communities.
Update my email addresses on behalf of my current employer (Alibaba Cloud)
to support all my (team) work in this area. Also add an outdated
@redhat.com address of me.
Nadav Amit [Mon, 11 Jul 2022 16:59:06 +0000 (09:59 -0700)]
userfaultfd: provide properly masked address for huge-pages
Commit 56f09d53f382 ("userfaultfd: provide unmasked address on
page-fault") was introduced to fix an old bug, in which the offset in the
address of a page-fault was masked. Concerns were raised - although were
never backed by actual code - that some userspace code might break because
the bug has been around for quite a while. To address these concerns a
new flag was introduced, and only when this flag is set by the user,
userfaultfd provides the exact address of the page-fault.
The commit however had a bug, and if the flag is unset, the offset was
always masked based on a base-page granularity. Yet, for huge-pages, the
behavior prior to the commit was that the address is masked to the
huge-page granulrity.
While there are no reports on real breakage, fix this issue. If the flag
is unset, use the address with the masking that was done before.
Link: https://lkml.kernel.org/r/20220711165906.2682-1-namit@vmware.com Fixes: 56f09d53f382 ("userfaultfd: provide unmasked address on page-fault") Signed-off-by: Nadav Amit <namit@vmware.com> Reported-by: James Houghton <jthoughton@google.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Peter Xu <peterx@redhat.com> Reviewed-by: James Houghton <jthoughton@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Jakub Kicinski [Tue, 26 Jul 2022 21:38:53 +0000 (14:38 -0700)]
Merge branch 'tls-rx-decrypt-from-the-tcp-queue'
Jakub Kicinski says:
====================
tls: rx: decrypt from the TCP queue
This is the final part of my TLS Rx rework. It switches from
strparser to decrypting data from skbs queued in TCP. We don't
need the full strparser for TLS, its needs are very basic.
This set gives us a small but measurable (6%) performance
improvement (continuous stream).
====================
Jakub Kicinski [Fri, 22 Jul 2022 23:50:33 +0000 (16:50 -0700)]
tls: rx: do not use the standard strparser
TLS is a relatively poor fit for strparser. We pause the input
every time a message is received, wait for a read which will
decrypt the message, start the parser, repeat. strparser is
built to delineate the messages, wrap them in individual skbs
and let them float off into the stack or a different socket.
TLS wants the data pages and nothing else. There's no need
for TLS to keep cloning (and occasionally skb_unclone()'ing)
the TCP rx queue.
This patch uses a pre-allocated skb and attaches the skbs
from the TCP rx queue to it as frags. TLS is careful never
to modify the input skb without CoW'ing / detaching it first.
Since we call TCP rx queue cleanup directly we also get back
the benefit of skb deferred free.
Overall this results in a 6% gain in my benchmarks.
Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jakub Kicinski [Fri, 22 Jul 2022 23:50:32 +0000 (16:50 -0700)]
tls: rx: device: add input CoW helper
Wrap the remaining skb_cow_data() into a helper, so it's easier
to replace down the lane. The new version will change the skb
so make sure relevant pointers get reloaded after the call.
Jakub Kicinski [Fri, 22 Jul 2022 23:50:31 +0000 (16:50 -0700)]
tcp: allow tls to decrypt directly from the tcp rcv queue
Expose TCP rx queue accessor and cleanup, so that TLS can
decrypt directly from the TCP queue. The expectation
is that the caller can access the skb returned from
tcp_recv_skb() and up to inq bytes worth of data (some
of which may be in ->next skbs) and then call
tcp_read_done() when data has been consumed.
The socket lock must be held continuously across
those two operations.
Jakub Kicinski [Fri, 22 Jul 2022 23:50:30 +0000 (16:50 -0700)]
tls: rx: device: keep the zero copy status with offload
The non-zero-copy path assumes a full skb with decrypted contents.
This means the device offload would have to CoW the data. Try
to keep the zero-copy status instead, copy the data to user space.
Jakub Kicinski [Fri, 22 Jul 2022 23:50:29 +0000 (16:50 -0700)]
tls: rx: don't free the output in case of zero-copy
In the future we'll want to reuse the input skb in case of
zero-copy so we shouldn't always free darg.skb. Move the
freeing of darg.skb into the non-zc cases. All cases will
now free ctx->recv_pkt (inside let tls_rx_rec_done()).