Peilin Ye [Mon, 8 Aug 2022 18:04:47 +0000 (11:04 -0700)]
vsock: Fix memory leak in vsock_connect()
An O_NONBLOCK vsock_connect() request may try to reschedule
@connect_work. Imagine the following sequence of vsock_connect()
requests:
1. The 1st, non-blocking request schedules @connect_work, which will
expire after 200 jiffies. Socket state is now SS_CONNECTING;
2. Later, the 2nd, blocking request gets interrupted by a signal after
a few jiffies while waiting for the connection to be established.
Socket state is back to SS_UNCONNECTED, but @connect_work is still
pending, and will expire after 100 jiffies.
3. Now, the 3rd, non-blocking request tries to schedule @connect_work
again. Since @connect_work is already scheduled,
schedule_delayed_work() silently returns. sock_hold() is called
twice, but sock_put() will only be called once in
vsock_connect_timeout(), causing a memory leak reported by syzbot:
The usage of FLAG_SEND_ZLP causes problems to other firmware/hardware
versions that have no issues.
The FLAG_SEND_ZLP is not safe to use in this context.
See:
https://patchwork.ozlabs.org/project/netdev/patch/1270599787.8900.8.camel@Linuxdev4-laptop/#118378
The original problem needs another way to solve.
====================
Do not use RT_TOS for IPv6 flowlabel
According to Guillaume Nault RT_TOS should never be used for IPv6.
Quote:
RT_TOS() is an old macro used to interprete IPv4 TOS as described in
the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4
code, although, given the current state of the code, most of the
existing calls have no consequence.
But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS"
field to be interpreted the RFC 1349 way. There's no historical
compatibility to worry about.
====================
Matthias May [Fri, 5 Aug 2022 19:19:06 +0000 (21:19 +0200)]
ipv6: do not use RT_TOS for IPv6 flowlabel
According to Guillaume Nault RT_TOS should never be used for IPv6.
Quote:
RT_TOS() is an old macro used to interprete IPv4 TOS as described in
the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4
code, although, given the current state of the code, most of the
existing calls have no consequence.
But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS"
field to be interpreted the RFC 1349 way. There's no historical
compatibility to worry about.
Fixes: ed095d08e6ed ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.") Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matthias May <matthias.may@westermo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthias May [Fri, 5 Aug 2022 19:19:05 +0000 (21:19 +0200)]
mlx5: do not use RT_TOS for IPv6 flowlabel
According to Guillaume Nault RT_TOS should never be used for IPv6.
Quote:
RT_TOS() is an old macro used to interprete IPv4 TOS as described in
the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4
code, although, given the current state of the code, most of the
existing calls have no consequence.
But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS"
field to be interpreted the RFC 1349 way. There's no historical
compatibility to worry about.
Fixes: 9feed98c9969 ("net/mlx5e: Support SRIOV TC encapsulation offloads for IPv6 tunnels") Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matthias May <matthias.may@westermo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthias May [Fri, 5 Aug 2022 19:19:04 +0000 (21:19 +0200)]
vxlan: do not use RT_TOS for IPv6 flowlabel
According to Guillaume Nault RT_TOS should never be used for IPv6.
Quote:
RT_TOS() is an old macro used to interprete IPv4 TOS as described in
the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4
code, although, given the current state of the code, most of the
existing calls have no consequence.
But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS"
field to be interpreted the RFC 1349 way. There's no historical
compatibility to worry about.
Fixes: cbd781486edd ("vxlan: allow setting ipv6 traffic class") Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matthias May <matthias.may@westermo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthias May [Fri, 5 Aug 2022 19:19:03 +0000 (21:19 +0200)]
geneve: do not use RT_TOS for IPv6 flowlabel
According to Guillaume Nault RT_TOS should never be used for IPv6.
Quote:
RT_TOS() is an old macro used to interprete IPv4 TOS as described in
the obsolete RFC 1349. It's conceptually wrong to use it even in IPv4
code, although, given the current state of the code, most of the
existing calls have no consequence.
But using RT_TOS() in IPv6 code is always a bug: IPv6 never had a "TOS"
field to be interpreted the RFC 1349 way. There's no historical
compatibility to worry about.
Fixes: 9373d053215c ("geneve: handle ipv6 priority like ipv4 tos") Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Matthias May <matthias.may@westermo.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Matthias May [Fri, 5 Aug 2022 19:00:06 +0000 (21:00 +0200)]
geneve: fix TOS inheriting for ipv4
The current code retrieves the TOS field after the lookup
on the ipv4 routing table. The routing process currently
only allows routing based on the original 3 TOS bits, and
not on the full 6 DSCP bits.
As a result the retrieved TOS is cut to the 3 bits.
However for inheriting purposes the full 6 bits should be used.
Extract the full 6 bits before the route lookup and use
that instead of the cut off 3 TOS bits.
Fixes: 3d7c024c7bb5 ("geneve: Add support to collect tunnel metadata.") Signed-off-by: Matthias May <matthias.may@westermo.com> Acked-by: Guillaume Nault <gnault@redhat.com> Link: https://lore.kernel.org/r/20220805190006.8078-1-matthias.may@westermo.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: atlantic: fix aq_vec index out of range error
The final update statement of the for loop exceeds the array range, the
dereference of self->aq_vec[i] is not checked and then leads to the
index out of range error.
Also fixed this kind of coding style in other for loop.
The following patchset contains Netfilter fixes for net:
1) Harden set element field checks to avoid out-of-bound memory access,
this patch also fixes the type of issue described in f9b73cd73839
("netfilter: nf_tables: stricter validation of element data") in a
broader way.
2) Patches to restrict the chain, set, and rule id lookup in the
transaction to the corresponding top-level table, patches from
Thadeu Lima de Souza Cascardo.
3) Fix incorrect comment in ip6t_LOG.h
4) nft_data_init() performs upfront validation of the expected data.
struct nft_data_desc is used to describe the expected data to be
received from userspace. The .size field represents the maximum size
that can be stored, for bound checks. Then, .len is an input/output field
which stores the expected length as input (this is optional, to restrict
the checks), as output it stores the real length received from userspace
(if it was not specified as input). This patch comes in response to f9b73cd73839 ("netfilter: nf_tables: stricter validation of element data")
to address this type of issue in a more generic way by avoid opencoded
data validation. Next patch requires this as a dependency.
5) Disallow jump to implicit chain from set element, this configuration
is invalid. Only allow jump to chain via immediate expression is
supported at this stage.
6) Fix possible null-pointer derefence in the error path of table updates,
if memory allocation of the transaction fails. From Florian Westphal.
* git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
netfilter: nf_tables: fix null deref due to zeroed list head
netfilter: nf_tables: disallow jump to implicit chain from set element
netfilter: nf_tables: upfront validation of data via nft_data_init()
netfilter: ip6t_LOG: Fix a typo in a comment
netfilter: nf_tables: do not allow RULE_ID to refer to another chain
netfilter: nf_tables: do not allow CHAIN_ID to refer to another table
netfilter: nf_tables: do not allow SET_ID to refer to another table
netfilter: nf_tables: validate variable length element extension
====================
Sebastian Würl [Thu, 4 Aug 2022 08:14:11 +0000 (10:14 +0200)]
can: mcp251x: Fix race condition on receive interrupt
The mcp251x driver uses both receiving mailboxes of the CAN controller
chips. For retrieving the CAN frames from the controller via SPI, it checks
once per interrupt which mailboxes have been filled and will retrieve the
messages accordingly.
This introduces a race condition, as another CAN frame can enter mailbox 1
while mailbox 0 is emptied. If now another CAN frame enters mailbox 0 until
the interrupt handler is called next, mailbox 0 is emptied before
mailbox 1, leading to out-of-order CAN frames in the network device.
This is fixed by checking the interrupt flags once again after freeing
mailbox 0, to correctly also empty mailbox 1 before leaving the handler.
For reproducing the bug I created the following setup:
- Two CAN devices, one Raspberry Pi with MCP2515, the other can be any.
- Setup CAN to 1 MHz
- Spam bursts of 5 CAN-messages with increasing CAN-ids
- Continue sending the bursts while sleeping a second between the bursts
- Check on the RPi whether the received messages have increasing CAN-ids
- Without this patch, every burst of messages will contain a flipped pair
The issue seems similar to commit 43dcd88433ef ("net: hisilicon: Fix a BUG
trigered by wrong bytes_compl") and potentially introduced by commit 033760492f70 ("bgmac: simplify tx ring index handling").
If there is an RX interrupt between setting ring->end
and netdev_sent_queue() we can hit the BUG_ON as bgmac_dma_tx_free()
can miscalculate the queue size while called from bgmac_poll().
The machine which triggered the BUG runs a v4.14 RT kernel - but the issue
seems present in mainline too.
Fixes: 033760492f70 ("bgmac: simplify tx ring index handling") Signed-off-by: Sandor Bodo-Merle <sbodomerle@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220808173939.193804-1-sbodomerle@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Mon, 8 Aug 2022 12:51:27 +0000 (15:51 +0300)]
net: dsa: felix: suppress non-changes to the tagging protocol
The way in which dsa_tree_change_tag_proto() works is that when
dsa_tree_notify() fails, it doesn't know whether the operation failed
mid way in a multi-switch tree, or it failed for a single-switch tree.
So even though drivers need to fail cleanly in
ds->ops->change_tag_protocol(), DSA will still call dsa_tree_notify()
again, to restore the old tag protocol for potential switches in the
tree where the change did succeeed (before failing for others).
This means for the felix driver that if we report an error in
felix_change_tag_protocol(), we'll get another call where proto_ops ==
old_proto_ops. If we proceed to act upon that, we may do unexpected
things. For example, we will call dsa_tag_8021q_register() twice in a
row, without any dsa_tag_8021q_unregister() in between. Then we will
actually call dsa_tag_8021q_unregister() via old_proto_ops->teardown,
which (if it manages to run at all, after walking through corrupted data
structures) will leave the ports inoperational anyway.
The bug can be readily reproduced if we force an error while in
tag_8021q mode; this crashes the kernel.
netfilter: nf_tables: fix null deref due to zeroed list head
In nf_tables_updtable, if nf_tables_table_enable returns an error,
nft_trans_destroy is called to free the transaction object.
nft_trans_destroy() calls list_del(), but the transaction was never
placed on a list -- the list head is all zeroes, this results in
a null dereference:
BUG: KASAN: null-ptr-deref in nft_trans_destroy+0x26/0x59
Call Trace:
nft_trans_destroy+0x26/0x59
nf_tables_newtable+0x4bc/0x9bc
[..]
Its sane to assume that nft_trans_destroy() can be called
on the transaction object returned by nft_trans_alloc(), so
make sure the list head is initialised.
Fixes: 24afa2cc4521 ("netfilter: nf_tables: use new transaction infrastructure to handle table") Reported-by: mingi cho <mgcho.minic@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: upfront validation of data via nft_data_init()
Instead of parsing the data and then validate that type and length are
correct, pass a description of the expected data so it can be validated
upfront before parsing it to bail out earlier.
This patch adds a new .size field to specify the maximum size of the
data area. The .len field is optional and it is used as an input/output
field, it provides the specific length of the expected data in the input
path. If then .len field is not specified, then obtained length from the
netlink attribute is stored. This is required by cmp, bitwise, range and
immediate, which provide no netlink attribute that describes the data
length. The immediate expression uses the destination register type to
infer the expected data type.
Relying on opencoded validation of the expected data might lead to
subtle bugs as described in f9b73cd73839 ("netfilter: nf_tables:
stricter validation of element data").
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: do not allow RULE_ID to refer to another chain
When doing lookups for rules on the same batch by using its ID, a rule from
a different chain can be used. If a rule is added to a chain but tries to
be positioned next to a rule from a different chain, it will be linked to
chain2, but the use counter on chain1 would be the one to be incremented.
When looking for rules by ID, use the chain that was used for the lookup by
name. The chain used in the context copied to the transaction needs to
match that same chain. That way, struct nft_rule does not need to get
enlarged with another member.
Fixes: 923354fe5352 ("netfilter: nf_tables: add NFTA_RULE_ID attribute") Fixes: 3c095cb5288d ("netfilter: nf_tables: Support RULE_ID reference in new rule") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: do not allow CHAIN_ID to refer to another table
When doing lookups for chains on the same batch by using its ID, a chain
from a different table can be used. If a rule is added to a table but
refers to a chain in a different table, it will be linked to the chain in
table2, but would have expressions referring to objects in table1.
Then, when table1 is removed, the rule will not be removed as its linked to
a chain in table2. When expressions in the rule are processed or removed,
that will lead to a use-after-free.
When looking for chains by ID, use the table that was used for the lookup
by name, and only return chains belonging to that same table.
Fixes: 40929d8e631b ("netfilter: nf_tables: add NFTA_RULE_CHAIN_ID attribute") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: do not allow SET_ID to refer to another table
When doing lookups for sets on the same batch by using its ID, a set from a
different table can be used.
Then, when the table is removed, a reference to the set may be kept after
the set is freed, leading to a potential use-after-free.
When looking for sets by ID, use the table that was used for the lookup by
name, and only return sets belonging to that same table.
This fixes CVE-2022-2586, also reported as ZDI-CAN-17470.
Reported-by: Team Orca of Sea Security (@seasecresponse) Fixes: 8406eef378c0 ("netfilter: nf_tables: use new transaction infrastructure to handle sets") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com> Cc: <stable@vger.kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: validate variable length element extension
Update template to validate variable length extensions. This patch adds
a new .ext_len[id] field to the template to store the expected extension
length. This is used to sanity check the initialization of the variable
length extension.
Use PTR_ERR() in nft_set_elem_init() to report errors since, after this
update, there are two reason why this might fail, either because of
ENOMEM or insufficient room in the extension field (EINVAL).
Kernels up until f9b73cd73839 ("netfilter: nf_tables: stricter
validation of element data") allowed to copy more data to the extension
than was allocated. This ext_len field allows to validate if the
destination has the correct size as additional check.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
clang emits a -Wunaligned-access warning on struct __packed
ems_cpc_msg.
The reason is that the anonymous union msg (not declared as packed) is
being packed right after some non naturally aligned variables (3*8
bits + 2*32) inside a packed struct:
| struct __packed ems_cpc_msg {
| u8 type; /* type of message */
| u8 length; /* length of data within union 'msg' */
| u8 msgid; /* confirmation handle */
| __le32 ts_sec; /* timestamp in seconds */
| __le32 ts_nsec; /* timestamp in nano seconds */
| /* ^ not naturally aligned */
|
| union {
| /* ^ not declared as packed */
| u8 generic[64];
| struct cpc_can_msg can_msg;
| struct cpc_can_params can_params;
| struct cpc_confirm confirmation;
| struct cpc_overrun overrun;
| struct cpc_can_error error;
| struct cpc_can_err_counter err_counter;
| u8 can_state;
| } msg;
| };
Starting from LLVM 14, having an unpacked struct nested in a packed
struct triggers a warning. c.f. [1].
Fix the warning by marking the anonymous union as packed.
Fixes: 07edeffafb68 ("ems_usb: Added support for EMS CPC-USB/ARM7 CAN/USB interface") Link: https://lore.kernel.org/all/20220802094021.959858-1-mkl@pengutronix.de Cc: Gerhard Uttenthaler <uttenthaler@ems-wuensche.com> Cc: Sebastian Haas <haas@ems-wuensche.com> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Fedor Pchelkin [Fri, 5 Aug 2022 15:02:16 +0000 (18:02 +0300)]
can: j1939: j1939_session_destroy(): fix memory leak of skbs
We need to drop skb references taken in j1939_session_skb_queue() when
destroying a session in j1939_session_destroy(). Otherwise those skbs
would be lost.
Link to Syzkaller info and repro: https://forge.ispras.ru/issues/11743.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
can: j1939: j1939_sk_queue_activate_next_locked(): replace WARN_ON_ONCE with netdev_warn_once()
We should warn user-space that it is doing something wrong when trying
to activate sessions with identical parameters but WARN_ON_ONCE macro
can not be used here as it serves a different purpose.
So it would be good to replace it with netdev_warn_once() message.
Found by Linux Verification Center (linuxtesting.org) with Syzkaller.
Fixes: c4da2dabb66d ("can: add support of SAE J1939 protocol") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/all/20220729143655.1108297-1-pchelkin@ispras.ru
[mkl: fix indention] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
Jakub Kicinski [Tue, 9 Aug 2022 03:59:07 +0000 (20:59 -0700)]
Merge tag 'for-net-2022-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:
====================
bluetooth pull request for net:
- Fixes various issues related to ISO channel/socket support
- Fixes issues when building with C=1
- Fix cancel uninitilized work which blocks syzbot to run
* tag 'for-net-2022-08-08' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
Bluetooth: ISO: Fix not using the correct QoS
Bluetooth: don't try to cancel uninitialized works at mgmt_index_removed()
Bluetooth: ISO: Fix iso_sock_getsockopt for BT_DEFER_SETUP
Bluetooth: MGMT: Fixes build warnings with C=1
Bluetooth: hci_event: Fix build warning with C=1
Bluetooth: ISO: Fix memory corruption
Bluetooth: Fix null pointer deref on unexpected status event
Bluetooth: ISO: Fix info leak in iso_sock_getsockopt()
Bluetooth: hci_conn: Fix updating ISO QoS PHY
Bluetooth: ISO: unlock on error path in iso_sock_setsockopt()
Bluetooth: L2CAP: Fix l2cap_global_chan_by_psm regression
====================
Since
commit 37ec61305ceb ("s390/qeth: detach netdevice while card is offline")
there was a timing window during recovery, that qeth_query_card_info could
be sent to the card, even before it was ready for it, leading to a failing
card recovery. There is evidence that this window was hit, as not all
callers of get_link_ksettings() check for netif_device_present.
Use cached values in qeth_get_link_ksettings(), instead of calling
qeth_query_card_info() and falling back to default values in case it
fails. Link info is already updated when the card goes online, e.g. after
STARTLAN (physical link up). Set the link info to default values, when the
card goes offline or at STOPLAN (physical link down). A follow-on patch
will improve values reported for link down.
Fixes: 37ec61305ceb ("s390/qeth: detach netdevice while card is offline") Signed-off-by: Alexandra Winter <wintera@linux.ibm.com> Reviewed-by: Thorsten Winkler <twinkler@linux.ibm.com> Link: https://lore.kernel.org/r/20220805155714.59609-1-wintera@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Nikita Shubin [Fri, 5 Aug 2022 08:48:43 +0000 (11:48 +0300)]
net: phy: dp83867: fix get nvmem cell fail
If CONFIG_NVMEM is not set of_nvmem_cell_get, of_nvmem_device_get
functions will return ERR_PTR(-EOPNOTSUPP) and "failed to get nvmem
cell io_impedance_ctrl" error would be reported despite "io_impedance_ctrl"
is completely missing in Device Tree and we should use default values.
Check -EOPNOTSUPP togather with -ENOENT to avoid this situation.
Fixes: 825ec515cb5b ("net: phy: dp83867: implement support for io_impedance_ctrl nvmem cell") Signed-off-by: Nikita Shubin <n.shubin@yadro.com> Acked-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://lore.kernel.org/r/20220805084843.24542-1-nikita.shubin@maquefel.me Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Duoming Zhou [Fri, 5 Aug 2022 07:00:08 +0000 (15:00 +0800)]
atm: idt77252: fix use-after-free bugs caused by tst_timer
There are use-after-free bugs caused by tst_timer. The root cause
is that there are no functions to stop tst_timer in idt77252_exit().
One of the possible race conditions is shown below:
min_gate_len[0] and min_gate_len[1] should be 200000, while
min_gate_len[2-7] should be 0.
However what happens is that min_gate_len[0] is 200000, but
min_gate_len[1] ends up being 0 (despite gate_len[1] being 200000 at the
point where the logic detects the gate close event for TC 1).
The problem is that the code considers a "gate close" event whenever it
sees that there is a 0 for that TC (essentially it's level rather than
edge triggered). By doing that, any time a gate is seen as closed
without having been open prior, gate_len, which is 0, will be written
into min_gate_len. Once min_gate_len becomes 0, it's impossible for it
to track anything higher than that (the length of actually open
intervals).
To fix this, we make the writing to min_gate_len[tc] be edge-triggered,
which avoids writes for gates that are closed in consecutive intervals.
However what this does is it makes us need to special-case the
permanently closed gates at the end.
Martin Schiller [Fri, 5 Aug 2022 06:18:10 +0000 (08:18 +0200)]
net/x25: fix call timeouts in blocking connects
When a userspace application starts a blocking connect(), a CALL REQUEST
is sent, the t21 timer is started and the connect is waiting in
x25_wait_for_connection_establishment(). If then for some reason the t21
timer expires before any reaction on the assigned logical channel (e.g.
CALL ACCEPT, CLEAR REQUEST), there is sent a CLEAR REQUEST and timer
t23 is started waiting for a CLEAR confirmation. If we now receive a
CLEAR CONFIRMATION from the peer, x25_disconnect() is called in
x25_state2_machine() with reason "0", which means "normal" call
clearing. This is ok, but the parameter "reason" is used as sk->sk_err
in x25_disconnect() and sock_error(sk) is evaluated in
x25_wait_for_connection_establishment() to check if the call is still
pending. As "0" is not rated as an error, the connect will stuck here
forever.
To fix this situation, also check if the sk->sk_state changed form
TCP_SYN_SENT to TCP_CLOSE in the meantime, which is also done by
x25_disconnect().
If tsnep_tx_map() fails, then tsnep_tx_unmap() shall start at the write
index like tsnep_tx_map(). This is different to the normal operation.
Thus, add an additional parameter to tsnep_tx_unmap() to enable start at
different positions for successful TX and failed TX.
Fixes: ed2d3a71e548 ("tsnep: Add TSN endpoint Ethernet MAC driver") Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
drivers/net/ethernet/engleder/tsnep_main.c:1254:34: warning:
'tsnep_of_match' defined but not used [-Wunused-const-variable=]
of_match_ptr() compiles into NULL if CONFIG_OF is disabled.
tsnep_of_match exists always so use of of_match_ptr() is useless.
Fix warning by dropping of_match_ptr().
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Gerhard Engleder <gerhard@engleder-embedded.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This fixes using wrong QoS settings when attempting to send frames while
acting as peripheral since the QoS settings in use are stored in
hconn->iso_qos not in sk->qos, this is actually properly handled on
getsockopt(BT_ISO_QOS) but not on iso_send_frame.
Fixes: e8b1014d0e46d ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Tetsuo Handa [Fri, 5 Aug 2022 07:12:18 +0000 (16:12 +0900)]
Bluetooth: don't try to cancel uninitialized works at mgmt_index_removed()
syzbot is reporting attempt to cancel uninitialized work at
mgmt_index_removed() [1], for calling cancel_delayed_work_sync() without
INIT_DELAYED_WORK() is not permitted.
INIT_DELAYED_WORK() is called from mgmt_init_hdev() via chan->hdev_init()
from hci_mgmt_cmd(), but cancel_delayed_work_sync() is unconditionally
called from mgmt_index_removed().
Call cancel_delayed_work_sync() only if HCI_MGMT flag was set, for
mgmt_init_hdev() sets HCI_MGMT flag when calling INIT_DELAYED_WORK().
Link: https://syzkaller.appspot.com/bug?extid=b8ddd338a8838e581b1c Reported-by: syzbot <syzbot+b8ddd338a8838e581b1c@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Fixes: 0f1d081cd2f4f73d ("Bluetooth: Convert delayed discov_off to hci_sync") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: ISO: Fix iso_sock_getsockopt for BT_DEFER_SETUP
BT_DEFER_SETUP shall be considered valid for all states except for
BT_CONNECTED as it is also used when initiated a connection rather then
only for BT_BOUND and BT_LISTEN.
Fixes: e8b1014d0e46d ("Bluetooth: Add BTPROTO_ISO socket type") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
This fixes the following warning when build with make C=1:
net/bluetooth/hci_event.c:337:15: warning: restricted __le16 degrades to integer
Fixes: ccf02a8169bd6 ("Bluetooth: Process result of HCI Delete Stored Link Key command") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: Fix null pointer deref on unexpected status event
__hci_cmd_sync returns NULL if the controller responds with a status
event. This is unexpected for the commands sent here, but on
occurrence leads to null pointer dereferences and thus must be
handled.
Signed-off-by: Soenke Huster <soenke.huster@eknoes.de> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Bluetooth: ISO: Fix info leak in iso_sock_getsockopt()
The C standard rules for when struct holes are zeroed out are slightly
weird. The existing assignments might initialize everything, but GCC
is allowed to (and does sometimes) leave the struct holes uninitialized,
so instead of using yet another variable and copy the QoS settings just
use a pointer to the stored QoS settings.
Fixes: e8b1014d0e46d ("Bluetooth: Add BTPROTO_ISO socket type") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
BT_ISO_QOS has different semantics when it comes to QoS PHY as it uses
0x00 to disable a direction but that value is invalid over HCI and
sockets using DEFER_SETUP to connect may attempt to use hci_bind_cis
multiple times in order to detect if the parameters have changed, so to
fix the code will now just mirror the PHY for the parameters of
HCI_OP_LE_SET_CIG_PARAMS and will not update the PHY of the socket
leaving it disabled.
Fixes: 235819527e62f ("Bluetooth: Add initial implementation of CIS connections") Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
The patch 5c507b737a50: "Bluetooth: L2CAP: Fix use-after-free caused
by l2cap_chan_put" from Jul 21, 2022, leads to the following Smatch
static checker warning:
net/bluetooth/l2cap_core.c:1977 l2cap_global_chan_by_psm()
error: we previously assumed 'c' could be null (see line 1996)
Fixes: 5c507b737a50 ("Bluetooth: L2CAP: Fix use-after-free caused by l2cap_chan_put") Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Gao Feng [Thu, 4 Aug 2022 15:04:21 +0000 (23:04 +0800)]
net: bpf: Use the protocol's set_rcvlowat behavior if there is one
The commit 96ba1ba9c27a ("tcp: fix SO_RCVLOWAT and RCVBUF autotuning")
add one new (struct proto_ops)->set_rcvlowat method so that a protocol
can override the default setsockopt(SO_RCVLOWAT) behavior.
The prior bpf codes don't check and invoke the protos's set_rcvlowat,
now correct it.
Signed-off-by: Gao Feng <gfree.wind@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xuan Zhuo [Thu, 4 Aug 2022 06:32:48 +0000 (14:32 +0800)]
virtio_net: fix memory leak inside XPD_TX with mergeable
When we call xdp_convert_buff_to_frame() to get xdpf, if it returns
NULL, we should check if xdp_page was allocated by xdp_linearize_page().
If it is newly allocated, it should be freed here alone. Just like any
other "goto err_xdp".
Fixes: 42440ded4588 ("xdp: transition into using xdp_frame for ndo_xdp_xmit") Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
wifi: cfg80211: Fix validating BSS pointers in __cfg80211_connect_result
Driver's SME is allowed to fill either BSSID or BSS pointers in struct
cfg80211_connect_resp_params when indicating connect response but a
check in __cfg80211_connect_result() is giving unnecessary warning when
driver's SME fills only BSSID pointer and not BSS pointer in struct
cfg80211_connect_resp_params.
In case of mac80211 with auth/assoc path, it is always expected to fill
BSS pointers in struct cfg80211_connect_resp_params when calling
__cfg80211_connect_result() since cfg80211 must have hold BSS pointers
in cfg80211_mlme_assoc().
So, skip the check for the drivers which support cfg80211 connect
callback, for example with brcmfmac is one such driver which had the
warning:
WARNING: CPU: 5 PID: 514 at net/wireless/sme.c:786 __cfg80211_connect_result+0x2fc/0x5c0 [cfg80211]
Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 7e55b6101cd4 ("cfg80211: Indicate MLO connection info in connect and roam callbacks") Signed-off-by: Veerendranath Jakkam <quic_vjakkam@quicinc.com>
[kvalo@kernel.org: add more info to the commit log] Signed-off-by: Kalle Valo <kvalo@kernel.org> Link: https://lore.kernel.org/r/20220805135259.4126630-1-quic_vjakkam@quicinc.com
net: seg6: initialize induction variable to first valid array index
Fixes the following warnings observed when building
CONFIG_IPV6_SEG6_LWTUNNEL=y with clang:
net/ipv6/seg6_local.o: warning: objtool: seg6_local_fill_encap() falls
through to next function seg6_local_get_encap_size()
net/ipv6/seg6_local.o: warning: objtool: seg6_local_cmp_encap() falls
through to next function input_action_end()
LLVM can fully unroll loops in seg6_local_get_encap_size() and
seg6_local_cmp_encap(). One issue in those loops is that the induction
variable is initialized to 0. The loop iterates over members of
seg6_action_params, a global array of struct seg6_action_param calling
their put() function pointer members. seg6_action_param uses an array
initializer to initialize SEG6_LOCAL_SRH and later elements, which is
the third enumeration of an anonymous union.
The guard `if (attrs & SEG6_F_ATTR(i))` may prevent this from being
called at runtime, but it would still be UB for
`seg6_action_params[0]->put` to be called; the unrolled loop will make
the initial iterations unreachable, which LLVM will later rotate to
fallthrough to the next function.
Make this more obvious that this cannot happen to the compiler by
initializing the loop induction variable to the minimum valid index that
seg6_action_params is initialized to.
net: bcmgenet: Indicate MAC is in charge of PHY PM
Avoid the PHY library call unnecessarily into the suspend/resume functions by
setting phydev->mac_managed_pm to true. The GENET driver essentially does
exactly what mdio_bus_phy_resume() does by calling phy_init_hw() plus
phy_resume().
Francois Romieu [Tue, 2 Aug 2022 15:07:42 +0000 (17:07 +0200)]
net: avoid overflow when rose /proc displays timer information.
rose /proc code does not serialize timer accesses.
Initial report by Bernard F6BVP Pidoux exhibits overflow amounting
to 116 ticks on its HZ=250 system.
Full timer access serialization would imho be overkill as rose /proc
does not enforce consistency between displayed ROSE_STATE_XYZ and
timer values during changes of state.
The patch may also fix similar behavior in ax25 /proc, ax25 ioctl
and netrom /proc as they all exhibit the same timer serialization
policy. This point has not been reported though.
The sole remaining use of ax25_display_timer - ax25 rtt valuation -
may also perform marginally better but I have not analyzed it too
deeply.
For packets scheduled to RPM and LBK, NIX_AF_PSE_CHANNEL_LEVEL[BP_LEVEL]
selects the TL3 or TL2 scheduling level as the one used for link/channel
selection and backpressure. For each scheduling queue at the selected
level: Setting NIX_AF_TL3_TL2(0..255)_LINK(0..12)_CFG[ENA] = 1 allows
the TL3/TL2 queue to schedule packets to a specified RPM or LBK link
and channel.
There is an issue in the code where NIX_AF_PSE_CHANNEL_LEVEL[BP_LEVEL]
is set to TL3 where as the NIX_AF_TL3_TL2(0..255)_LINK(0..12)_CFG is
configured for TL2 queue in some cases. As a result packets will not
transmit on that link/channel. This patch fixes the issue by configuring
the NIX_AF_TL3_TL2(0..255)_LINK(0..12)_CFG register depending on the
NIX_AF_PSE_CHANNEL_LEVEL[BP_LEVEL] value.
Jakub Kicinski [Sat, 6 Aug 2022 01:56:36 +0000 (18:56 -0700)]
Merge branch 'octeontx2-af-driver-fixes-for-npc'
Subbaraya Sundeep says:
====================
Octeontx2 AF driver fixes for NPC
This patchset includes AF driver fixes wrt packet parser NPC.
Following are the changes:
Patch 1: The parser nibble configuration must be same for
TX and RX interfaces and if not fix up is applied. This fixup was
applied only for default profile currently and it has been fixed
to apply for all profiles.
Patch 2: Firmware image may not be present all times in the kernel image
and default profile is used mostly hence suppress the warning.
Patch 3: This patch fixes a corner case where NIXLF is detached but
without freeing its mcam entries which results in resource leak.
Patch 4: SMAC is overlapped with DMAC mistakenly while installing
rules based on SMAC. This patch fixes that.
====================
Given a field with its location/offset in input packet,
the key checking logic verifies whether extracting the
field can be supported or not based on the mkex profile
loaded in hardware. This logic is wrong wrt source mac
and this patch fixes that.
Fixes: 5e3b91a29171 ("octeontx2-af: Generate key field bit mask from KEX profile") Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The teardown sequence in FLR handler returns if no NIX LF
is attached to PF/VF because it indicates that graceful
shutdown of resources already happened. But there is a
chance of all allocated MCAM entries not being freed by
PF/VF. Hence free mcam entries even in case of detached LF.
The packet parser profile supplied as firmware may not
be present all the time and default profile is used mostly.
Hence suppress firmware loading warning from kernel due to
absence of firmware in kernel image.
Fixes: dee4027c5b91 ("octeontx2-af: add support for custom KPU entries") Signed-off-by: Harman Kalra <hkalra@marvell.com> Signed-off-by: Subbaraya Sundeep <sbhatta@marvell.com> Signed-off-by: Sunil Goutham <sgoutham@marvell.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
NPC_PARSE_NIBBLE for TX interface has to be equal to the RX one for some
silicon revisions. Mistakenly this fixup was only applied to the default
MKEX profile while it should also be applied to any loaded profile.
selftests: netfilter: add test case for nf trace infrastructure
Enable/disable tracing infrastructure while packets are in-flight.
This triggers KASAN splat after 7a71b8c6e1bb ("netfilter: nf_tables: avoid skb access on nf_stolen").
While at it, reduce script run time as well.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Cezar Bulinaru [Wed, 3 Aug 2022 06:28:16 +0000 (02:28 -0400)]
selftests: add few test cases for tap driver
Few test cases related to the fix for 87b674ac3159:
"net: check if protocol extracted by virtio_net_hdr_set_proto is correct"
Need test for the case when a non-standard packet (GSO without NEEDS_CSUM)
sent to the tap device causes a BUG check in the tap driver.
Signed-off-by: Cezar Bulinaru <cbulinaru@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Cezar Bulinaru [Wed, 3 Aug 2022 06:27:59 +0000 (02:27 -0400)]
net: tap: NULL pointer derefence in dev_parse_header_protocol when skb->dev is null
Fixes a NULL pointer derefence bug triggered from tap driver.
When tap_get_user calls virtio_net_hdr_to_skb the skb->dev is null
(in tap.c skb->dev is set after the call to virtio_net_hdr_to_skb)
virtio_net_hdr_to_skb calls dev_parse_header_protocol which
needs skb->dev field to be valid.
The line that trigers the bug is in dev_parse_header_protocol
(dev is at offset 0x10 from skb and is stored in RAX register)
if (!dev->header_ops || !dev->header_ops->parse_protocol)
22e1: mov 0x10(%rbx),%rax
22e5: mov 0x230(%rax),%rax
Setting skb->dev before the call in tap.c fixes the issue.
Fixes: 87b674ac3159 ("net: check if protocol extracted by virtio_net_hdr_set_proto is correct") Signed-off-by: Cezar Bulinaru <cbulinaru@gmail.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When the selftest got added, sendfile() on mptcp sockets returned
-EOPNOTSUPP, so running 'mptcp_connect.sh -m sendfile' failed
immediately.
This is no longer the case, but the script fails anyway due to timeout.
Let the receiver know once the sender has sent all data, just like
with '-m mmap' mode.
v2: need to respect cfg_wait too, as pm_userspace.sh relied
on -m sendfile to keep the connection open (Mat Martineau)
Fixes: 40e38737120b ("mptcp: add basic kselftest for mptcp") Reported-by: Xiumei Mu <xmu@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The root cause of the problem is that an mptcp-level (re)transmit can
race with mptcp_close() and the packet scheduler checks the subflow
state before acquiring the socket lock: we can try to (re)transmit on
an already closed ssk.
Fix the issue checking again the subflow socket status under the
subflow socket lock protection. Additionally add the missing check
for the fallback-to-tcp case.
Fixes: 9845d6ff0fbf ("mptcp: allow picking different xmit subflows") Reported-by: Dipanjan Das <mail.dipanjan.das@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Paolo Abeni [Fri, 5 Aug 2022 00:21:25 +0000 (17:21 -0700)]
mptcp: move subflow cleanup in mptcp_destroy_common()
If the mptcp socket creation fails due to a CGROUP_INET_SOCK_CREATE
eBPF program, the MPTCP protocol ends-up leaking all the subflows:
the related cleanup happens in __mptcp_destroy_sock() that is not
invoked in such code path.
Address the issue moving the subflow sockets cleanup in the
mptcp_destroy_common() helper, which is invoked in every msk cleanup
path.
Additionally get rid of the intermediate list_splice_init step, which
is an unneeded relic from the past.
The issue is present since before the reported root cause commit, but
any attempt to backport the fix before that hash will require a complete
rewrite.
Fixes: 6399a9019d ("mptcp: refactor shutdown and close") Reported-by: Nguyen Dinh Phi <phind.uet@gmail.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Co-developed-by: Nguyen Dinh Phi <phind.uet@gmail.com> Signed-off-by: Nguyen Dinh Phi <phind.uet@gmail.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yu Xiao [Tue, 2 Aug 2022 09:33:55 +0000 (10:33 +0100)]
nfp: ethtool: fix the display error of `ethtool -m DEVNAME`
The port flag isn't set to `NFP_PORT_CHANGED` when using
`ethtool -m DEVNAME` before, so the port state (e.g. interface)
cannot be updated. Therefore, it caused that `ethtool -m DEVNAME`
sometimes cannot read the correct information.
E.g. `ethtool -m DEVNAME` cannot work when load driver before plug
in optical module, as the port interface is still NONE without port
update.
Now update the port state before sending info to NIC to ensure that
port interface is correct (latest state).
Fixes: c35c5cc1c0fe ("nfp: implement ethtool get module EEPROM") Reviewed-by: Louis Peens <louis.peens@corigine.com> Signed-off-by: Yu Xiao <yu.xiao@corigine.com> Signed-off-by: Simon Horman <simon.horman@corigine.com> Link: https://lore.kernel.org/r/20220802093355.69065-1-simon.horman@corigine.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: phy: Warn about incorrect mdio_bus_phy_resume() state
Calling mdio_bus_phy_resume() with neither the PHY state machine set to
PHY_HALTED nor phydev->mac_managed_pm set to true is a good indication
that we can produce a race condition looking like this:
with the phy_resume() function triggering a PHY behavior that might have
to be worked around with (see c83fad30555a ("net: phy: broadcom: Fix
brcm_fet_config_init()") for instance) that ultimately leads to an error
reading from the PHY.
====================
Make DSA work with bonding's ARP monitor
Since commit 38eab6e9ad58 ("net: dsa: declare lockless TX feature for
slave ports") in v5.7, DSA breaks the ARP monitoring logic from the
bonding driver, fact which was pointed out by Brian Hutchinson who uses
a linux-5.10.y stable kernel.
Initially I got lured by other similar hacks introduced for other
NETIF_F_LLTX drivers, which, inspired by the bonding documentation,
update the trans_start of their TX queues by hand.
However Jakub pointed out that this simply isn't a proper solution, and
after coming to think more about it, I agree, and it doesn't work
properly with DSA nor is it maintainable for the future changes I plan
for it (multiple DSA masters in a LAG).
I've tested these changes using a DSA-based setup and a veth-based
setup, using the active-backup mode and ARP monitoring, with and without
arp_validate.
Link to v1:
https://patchwork.kernel.org/project/netdevbpf/patch/20220715232641.952532-1-vladimir.oltean@nxp.com/
Link to v2:
https://patchwork.kernel.org/project/netdevbpf/patch/20220727152000.3616086-1-vladimir.oltean@nxp.com/
====================
Vladimir Oltean [Sun, 31 Jul 2022 12:41:07 +0000 (15:41 +0300)]
Revert "veth: Add updating of trans_start"
This reverts commit ca626fa5ed4166b134d2545ead4cf00b44f1a553. The veth
driver no longer needs these hacks which are slightly detrimential to
the fast path performance, because the bonding driver is keeping track
of TX times of ARP and NS probes by itself, which it should.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Sun, 31 Jul 2022 12:41:06 +0000 (15:41 +0300)]
net/sched: remove hacks added to dev_trans_start() for bonding to work
Now that the bonding driver keeps track of the last TX time of ARP and
NS probes, we effectively revert the following commits:
ce18080c2bc7 ("net_sched: use macvlan real dev trans_start in dev_trans_start()") 6bba1480f7e9 ("net_sched: make dev_trans_start return vlan's real dev trans_start")
Note that the approach of continuing to hack at this function would not
get us very far, hence the desire to take a different approach. DSA is
also a virtual device that uses NETIF_F_LLTX, but there, many uppers
share the same lower (DSA master, i.e. the physical host port of a
switch). By making dev_trans_start() on a DSA interface return the
dev_trans_start() of the master, we effectively assume that all other
DSA interfaces are silent, otherwise this corrupts the validity of the
probe timestamp data from the bonding driver's perspective.
Furthermore, the hacks didn't take into consideration the fact that the
lower interface of @dev may not have been physical either. For example,
VLAN over VLAN, or DSA with 2 masters in a LAG.
And even furthermore, there are NETIF_F_LLTX devices which are not
stacked, like veth. The hack here would not work with those, because it
would not have to provide the bonding driver something to chew at all.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Sun, 31 Jul 2022 12:41:05 +0000 (15:41 +0300)]
net: bonding: replace dev_trans_start() with the jiffies of the last ARP/NS
The bonding driver piggybacks on time stamps kept by the network stack
for the purpose of the netdev TX watchdog, and this is problematic
because it does not work with NETIF_F_LLTX devices.
It is hard to say why the driver looks at dev_trans_start() of the
slave->dev, considering that this is updated even by non-ARP/NS probes
sent by us, and even by traffic not sent by us at all (for example PTP
on physical slave devices). ARP monitoring in active-backup mode appears
to still work even if we track only the last TX time of actual ARP
probes.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- Add support for 64 bits enum, to match types exposed by kernel.
- Introduce support for sleepable uprobes program.
- Introduce support for enum textual representation in libbpf.
- New helpers to implement synproxy with eBPF/XDP.
- Improve loop performances, inlining indirect calls when possible.
- Removed all the deprecated libbpf APIs.
- Implement new eBPF-based LSM flavor.
- Add type match support, which allow accurate queries to the eBPF
used types.
- A few TCP congetsion control framework usability improvements.
- Add new infrastructure to manipulate CT entries via eBPF programs.
- Allow for livepatch (KLP) and BPF trampolines to attach to the same
kernel function.
Protocols:
- Introduce per network namespace lookup tables for unix sockets,
increasing scalability and reducing contention.
- Preparation work for Wi-Fi 7 Multi-Link Operation (MLO) support.
- Add support to forciby close TIME_WAIT TCP sockets via user-space
tools.
- Significant performance improvement for the TLS 1.3 receive path,
both for zero-copy and not-zero-copy.
- Support for changing the initial MTPCP subflow priority/backup
status
- Introduce virtually contingus buffers for sockets over RDMA, to
cope better with memory pressure.
- Extend CAN ethtool support with timestamping capabilities
- Refactor CAN build infrastructure to allow building only the needed
features.
Driver API:
- Remove devlink mutex to allow parallel commands on multiple links.
- Add support for pause stats in distributed switch.
- Implement devlink helpers to query and flash line cards.
- New helper for phy mode to register conversion.
New hardware / drivers:
- Ethernet DSA driver for the rockchip mt7531 on BPI-R2 Pro.
- Ethernet DSA driver for the Renesas RZ/N1 A5PSW switch.
- Ethernet DSA driver for the Microchip LAN937x switch.
- Ethernet PHY driver for the Aquantia AQR113C EPHY.
- CAN driver for the OBD-II ELM327 interface.
- CAN driver for RZ/N1 SJA1000 CAN controller.
- Bluetooth: Infineon CYW55572 Wi-Fi plus Bluetooth combo device.
Drivers:
- Intel Ethernet NICs:
- i40e: add support for vlan pruning
- i40e: add support for XDP framented packets
- ice: improved vlan offload support
- ice: add support for PPPoE offload
- Mellanox Ethernet (mlx5)
- refactor packet steering offload for performance and scalability
- extend support for TC offload
- refactor devlink code to clean-up the locking schema
- support stacked vlans for bridge offloads
- use TLS objects pool to improve connection rate
- Netronome Ethernet NICs (nfp):
- extend support for IPv6 fields mangling offload
- add support for vepa mode in HW bridge
- better support for virtio data path acceleration (VDPA)
- enable TSO by default
- Microsoft vNIC driver (mana)
- add support for XDP redirect
- Others Ethernet drivers:
- bonding: add per-port priority support
- microchip lan743x: extend phy support
- Fungible funeth: support UDP segmentation offload and XDP xmit
- Solarflare EF100: add support for virtual function representors
- MediaTek SoC: add XDP support
- Mellanox Ethernet/IB switch (mlxsw):
- dropped support for unreleased H/W (XM router).
- improved stats accuracy
- unified bridge model coversion improving scalability (parts 1-6)
- support for PTP in Spectrum-2 asics
- Broadcom PHYs
- add PTP support for BCM54210E
- add support for the BCM53128 internal PHY
- Marvell Ethernet switches (prestera):
- implement support for multicast forwarding offload
- Embedded Ethernet switches:
- refactor OcteonTx MAC filter for better scalability
- improve TC H/W offload for the Felix driver
- refactor the Microchip ksz8 and ksz9477 drivers to share the
probe code (parts 1, 2), add support for phylink mac
configuration
- Other WiFi:
- Microchip wilc1000: diable WEP support and enable WPA3
- Atheros ath10k: encapsulation offload support
Old code removal:
- Neterion vxge ethernet driver: this is untouched since more than 10 years"
* tag 'net-next-6.0' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1890 commits)
doc: sfp-phylink: Fix a broken reference
wireguard: selftests: support UML
wireguard: allowedips: don't corrupt stack when detecting overflow
wireguard: selftests: update config fragments
wireguard: ratelimiter: use hrtimer in selftest
net/mlx5e: xsk: Discard unaligned XSK frames on striding RQ
net: usb: ax88179_178a: Bind only to vendor-specific interface
selftests: net: fix IOAM test skip return code
net: usb: make USB_RTL8153_ECM non user configurable
net: marvell: prestera: remove reduntant code
octeontx2-pf: Reduce minimum mtu size to 60
net: devlink: Fix missing mutex_unlock() call
net/tls: Remove redundant workqueue flush before destroy
net: txgbe: Fix an error handling path in txgbe_probe()
net: dsa: Fix spelling mistakes and cleanup code
Documentation: devlink: add add devlink-selftests to the table of contents
dccp: put dccp_qpolicy_full() and dccp_qpolicy_push() in the same lock
net: ionic: fix error check for vlan flags in ionic_set_nic_features()
net: ice: fix error NETIF_F_HW_VLAN_CTAG_FILTER check in ice_vsi_sync_fltr()
nfp: flower: add support for tunnel offload without key ID
...
Linus Torvalds [Wed, 3 Aug 2022 22:26:04 +0000 (15:26 -0700)]
Merge tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata
Pull ATA updates from Damien Le Moal:
- Some code refactoring for the pata_hpt37x and pata_hpt3x2n drivers,
from Sergei.
- Several patches to cleanup in libata-core, libata-scsi and libata-eh
code: fixes arguments and variables types, change some functions
declaration to static and fix for a typo in a comment. From Sergey
and Xiang.
- Fix a compilation warning in the pata_macio driver, from me.
- A fix for the expected number of resources in the sata_mv driver fix,
from Andrew.
* tag 'ata-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/libata:
ata: sata_mv: Fixes expected number of resources now IRQs are gone
ata: libata-scsi: fix result type of ata_ioc32()
ata: pata_macio: Fix compilation warning
ata: libata-eh: fix sloppy result type of ata_internal_cmd_timeout()
ata: libata-core: fix sloppy parameter type in ata_exec_internal[_sg]()
ata: make ata_port::fastdrain_cnt *unsigned int*
ata: libata-eh: fix sloppy result type of ata_eh_nr_in_flight()
ata: libata-core: make ata_exec_internal_sg() *static*
ata: make transfer mode masks *unsigned int*
ata: libata-core: get rid of *else* branches in ata_id_n_sectors()
ata: libata-core: fix sloppy typing in ata_id_n_sectors()
ata: pata_hpt3x2n: pass base DPLL frequency to hpt3x2n_pci_clock()
ata: pata_hpt37x: merge hpt374_read_freq() to hpt37x_pci_clock()
ata: pata_hpt37x: factor out hpt37x_pci_clock()
ata: pata_hpt37x: move claculating PCI clock from hpt37x_clock_slot()
ata: libata: Fix syntax errors in comments
Linus Torvalds [Wed, 3 Aug 2022 22:21:53 +0000 (15:21 -0700)]
Merge tag 'zonefs-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs
Pull zonefs update from Damien Le Moal:
"A single change for this cycle to simplify handling of the memory page
used as super block buffer during mount (from Fabio)"
* tag 'zonefs-5.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: Call page_address() on page acquired with GFP_KERNEL flag
Linus Torvalds [Wed, 3 Aug 2022 22:16:49 +0000 (15:16 -0700)]
Merge tag 'iomap-5.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull iomap updates from Darrick Wong:
"The most notable change in this first batch is that we no longer
schedule pages beyond i_size for writeback, preferring instead to let
truncate deal with those pages.
Next week, there may be a second pull request to remove
iomap_writepage from the other two filesystems (gfs2/zonefs) that use
iomap for buffered IO. This follows in the same vein as the recent
removal of writepage from XFS, since it hasn't been triggered in a few
years; it does nothing during direct reclaim; and as far as the people
who examined the patchset can tell, it's moving the codebase in the
right direction.
However, as it was a late addition to for-next, I'm holding off on
that section for another week of testing to see if anyone can come up
with a solid reason for holding off in the meantime.
Summary:
- Skip writeback for pages that are completely beyond EOF
- Minor code cleanups"
* tag 'iomap-5.20-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
dax: set did_zero to true when zeroing successfully
iomap: set did_zero to true when zeroing successfully
iomap: skip pages past eof in iomap_do_writepage()
Linus Torvalds [Wed, 3 Aug 2022 21:54:52 +0000 (14:54 -0700)]
Merge tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"This brings some long awaited changes, the send protocol bump,
otherwise lots of small improvements and fixes. The main core part is
reworking bio handling, cleaning up the submission and endio and
improving error handling.
There are some changes outside of btrfs adding helpers or updating
API, listed at the end of the changelog.
Features:
- sysfs:
- export chunk size, in debug mode add tunable for setting its size
- show zoned among features (was only in debug mode)
- show commit stats (number, last/max/total duration)
- send protocol updated to 2
- new commands:
- ability write larger data chunks than 64K
- send raw compressed extents (uses the encoded data ioctls),
ie. no decompression on send side, no compression needed on
receive side if supported
- send 'otime' (inode creation time) among other timestamps
- send file attributes (a.k.a file flags and xflags)
- this is first version bump, backward compatibility on send and
receive side is provided
- there are still some known and wanted commands that will be
implemented in the near future, another version bump will be
needed, however we want to minimize that to avoid causing
usability issues
- print checksum type and implementation at mount time
- don't print some messages at mount (mentioned as people asked about
it), we want to print messages namely for new features so let's
make some space for that
- big metadata - this has been supported for a long time and is
not a feature that's worth mentioning
- skinny metadata - same reason, set by default by mkfs
Performance improvements:
- reduced amount of reserved metadata for delayed items
- when inserted items can be batched into one leaf
- when deleting batched directory index items
- when deleting delayed items used for deletion
- overall improved count of files/sec, decreased subvolume lock
contention
- metadata item access bounds checker micro-optimized, with a few
percent of improved runtime for metadata-heavy operations
- increase direct io limit for read to 256 sectors, improved
throughput by 3x on sample workload
Notable fixes:
- raid56
- reduce parity writes, skip sectors of stripe when there are no
data updates
- restore reading from on-disk data instead of using stripe cache,
this reduces chances to damage correct data due to RMW cycle
- refuse to replay log with unknown incompat read-only feature bit
set
- zoned
- fix page locking when COW fails in the middle of allocation
- improved tracking of active zones, ZNS drives may limit the
number and there are ENOSPC errors due to that limit and not
actual lack of space
- adjust maximum extent size for zone append so it does not cause
late ENOSPC due to underreservation
- mirror reading error messages show the mirror number
- don't fallback to buffered IO for NOWAIT direct IO writes, we don't
have the NOWAIT semantics for buffered io yet
- send, fix sending link commands for existing file paths when there
are deleted and created hardlinks for same files
- repair all mirrors for profiles with more than 1 copy (raid1c34)
- fix repair of compressed extents, unify where error detection and
repair happen
Core changes:
- bio completion cleanups
- don't double defer compression bios
- simplify endio workqueues
- add more data to btrfs_bio to avoid allocation for read requests
- rework bio error handling so it's same what block layer does,
the submission works and errors are consumed in endio
- when asynchronous bio offload fails fall back to synchronous
checksum calculation to avoid errors under writeback or memory
pressure
- super block log_root_transid deprecated (never used)
- mixed_backref and big_metadata sysfs feature files removed, they've
been default for sufficiently long time, there are no known users
and mixed_backref could be confused with mixed_groups
Non-btrfs changes, API updates:
- minor highmem API update to cover const arguments
* tag 'for-5.20-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (163 commits)
btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
btrfs: fix repair of compressed extents
btrfs: remove the start argument to check_data_csum and export
btrfs: pass a btrfs_bio to btrfs_repair_one_sector
btrfs: simplify the pending I/O counting in struct compressed_bio
btrfs: repair all known bad mirrors
btrfs: merge btrfs_dev_stat_print_on_error with its only caller
btrfs: join running log transaction when logging new name
btrfs: simplify error handling in btrfs_lookup_dentry
btrfs: send: always use the rbtree based inode ref management infrastructure
btrfs: send: fix sending link commands for existing file paths
btrfs: send: introduce recorded_ref_alloc and recorded_ref_free
btrfs: zoned: wait until zone is finished when allocation didn't progress
btrfs: zoned: write out partially allocated region
btrfs: zoned: activate necessary block group
btrfs: zoned: activate metadata block group on flush_space
btrfs: zoned: disable metadata overcommit for zoned
btrfs: zoned: introduce space_info->active_total_bytes
btrfs: zoned: finish least available block group on data bg allocation
btrfs: let can_allocate_chunk return error
...
Linus Torvalds [Wed, 3 Aug 2022 21:41:36 +0000 (14:41 -0700)]
Merge tag 'efi-efivars-removal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull efivars sysfs interface removal from Ard Biesheuvel:
"Remove the obsolete 'efivars' sysfs based interface to the EFI
variable store, now that all users have moved to the efivarfs pseudo
file system, which was created ~10 years ago to address some
fundamental shortcomings in the sysfs based driver.
Move the 'business logic' related to which EFI variables are important
and may affect the boot flow from the efivars support layer into the
efivarfs pseudo file system, so it is no longer exposed to other parts
of the kernel"
* tag 'efi-efivars-removal-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi:
efi: vars: Move efivar caching layer into efivarfs
efi: vars: Switch to new wrapper layer
efi: vars: Remove deprecated 'efivars' sysfs interface
Linus Torvalds [Wed, 3 Aug 2022 21:38:02 +0000 (14:38 -0700)]
Merge tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi
Pull EFI updates from Ard Biesheuvel:
- Enable mirrored memory for arm64
- Fix up several abuses of the efivar API
- Refactor the efivar API in preparation for moving the 'business
logic' part of it into efivarfs
- Enable ACPI PRM on arm64
* tag 'efi-next-for-v5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: (24 commits)
ACPI: Move PRM config option under the main ACPI config
ACPI: Enable Platform Runtime Mechanism(PRM) support on ARM64
ACPI: PRM: Change handler_addr type to void pointer
efi: Simplify arch_efi_call_virt() macro
drivers: fix typo in firmware/efi/memmap.c
efi: vars: Drop __efivar_entry_iter() helper which is no longer used
efi: vars: Use locking version to iterate over efivars linked lists
efi: pstore: Omit efivars caching EFI varstore access layer
efi: vars: Add thin wrapper around EFI get/set variable interface
efi: vars: Don't drop lock in the middle of efivar_init()
pstore: Add priv field to pstore_record for backend specific use
Input: applespi - avoid efivars API and invoke EFI services directly
selftests/kexec: remove broken EFI_VARS secure boot fallback check
brcmfmac: Switch to appropriate helper to load EFI variable contents
iwlwifi: Switch to proper EFI variable store interface
media: atomisp_gmin_platform: stop abusing efivar API
efi: efibc: avoid efivar API for setting variables
efi: avoid efivars layer when loading SSDTs from variables
efi: Correct comment on efi_memmap_alloc
memblock: Disable mirror feature if kernelcore is not specified
...
Linus Torvalds [Wed, 3 Aug 2022 21:03:51 +0000 (14:03 -0700)]
Merge tag 'pull-work.9p' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull 9p iov_iter fix from Al Viro:
"net/9p abuses iov_iter primitives - it attempts to copy _from_ a
destination-only iov_iter when it handles Rerror arriving in reply to
zero-copy request. Not hard to fix, fortunately.
This is a prereq for the iov_iter_get_pages() work in the second part
of iov_iter series, ended up in a separate branch"
* tag 'pull-work.9p' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
9p: handling Rerror without copy_from_iter_full()
Linus Torvalds [Wed, 3 Aug 2022 20:59:15 +0000 (13:59 -0700)]
Merge tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull copy_to_iter_mc fix from Al Viro:
"Backportable fix for copy_to_iter_mc() - the second part of iov_iter
work will pretty much overwrite this, but would be much harder to
backport"
* tag 'pull-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fix short copy handling in copy_mc_pipe_to_iter()
Linus Torvalds [Wed, 3 Aug 2022 20:50:22 +0000 (13:50 -0700)]
Merge tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs iov_iter updates from Al Viro:
"Part 1 - isolated cleanups and optimizations.
One of the goals is to reduce the overhead of using ->read_iter() and
->write_iter() instead of ->read()/->write().
new_sync_{read,write}() has a surprising amount of overhead, in
particular inside iocb_flags(). That's the explanation for the
beginning of the series is in this pile; it's not directly
iov_iter-related, but it's a part of the same work..."
* tag 'pull-work.iov_iter-base' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
first_iovec_segment(): just return address
iov_iter: massage calling conventions for first_{iovec,bvec}_segment()
iov_iter: first_{iovec,bvec}_segment() - simplify a bit
iov_iter: lift dealing with maxpages out of first_{iovec,bvec}_segment()
iov_iter_get_pages{,_alloc}(): cap the maxsize with MAX_RW_COUNT
iov_iter_bvec_advance(): don't bother with bvec_iter
copy_page_{to,from}_iter(): switch iovec variants to generic
keep iocb_flags() result cached in struct file
iocb: delay evaluation of IS_SYNC(...) until we want to check IOCB_DSYNC
struct file: use anonymous union member for rcuhead and llist
btrfs: use IOMAP_DIO_NOSYNC
teach iomap_dio_rw() to suppress dsync
No need of likely/unlikely on calls of check_copy_size()
Linus Torvalds [Wed, 3 Aug 2022 18:43:12 +0000 (11:43 -0700)]
Merge tag 'pull-work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs dcache updates from Al Viro:
"The main part here is making parallel lookups safe for RT - making
sure preemption is disabled in start_dir_add()/ end_dir_add() sections
(on non-RT it's automatic, on RT it needs to to be done explicitly)
and moving wakeups from __d_lookup_done() inside of such to the end of
those sections.
Wakeups can be safely delayed for as long as ->d_lock on in-lookup
dentry is held; proving that has caught a bug in d_add_ci() that
allows memory corruption when sufficiently bogus ntfs (or
case-insensitive xfs) image is mounted. Easily fixed, fortunately"
* tag 'pull-work.dcache' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs/dcache: Move wakeup out of i_seq_dir write held region.
fs/dcache: Move the wakeup from __d_lookup_done() to the caller.
fs/dcache: Disable preemption on i_dir_seq write side on PREEMPT_RT
d_add_ci(): make sure we don't miss d_lookup_done()
Linus Torvalds [Wed, 3 Aug 2022 18:35:20 +0000 (11:35 -0700)]
Merge tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs lseek updates from Al Viro:
"Jason's lseek series.
Saner handling of 'lseek should fail with ESPIPE' - this gets rid of
the magical no_llseek thing and makes checks consistent.
In particular, the ad-hoc "can we do splice via internal pipe" checks
got saner (and somewhat more permissive, which is what Jason had been
after, AFAICT)"
* tag 'pull-work.lseek' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: remove no_llseek
fs: check FMODE_LSEEK to control internal pipe splicing
vfio: do not set FMODE_LSEEK flag
dma-buf: remove useless FMODE_LSEEK flag
fs: do not compare against ->llseek
fs: clear or set FMODE_LSEEK based on llseek function
Linus Torvalds [Wed, 3 Aug 2022 18:31:42 +0000 (11:31 -0700)]
Merge tag 'pull-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs namei updates from Al Viro:
"RCU pathwalk cleanups.
Storing sampled ->d_seq of the next dentry in nameidata simplifies
life considerably, especially if we delay fetching ->d_inode until
step_into()"
* tag 'pull-work.namei' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
step_into(): move fetching ->d_inode past handle_mounts()
lookup_fast(): don't bother with inode
follow_dotdot{,_rcu}(): don't bother with inode
step_into(): lose inode argument
namei: stash the sampled ->d_seq into nameidata
namei: move clearing LOOKUP_RCU towards rcu_read_unlock()
switch try_to_unlazy_next() to __legitimize_mnt()
follow_dotdot{,_rcu}(): change calling conventions
namei: get rid of pointless unlikely(read_seqcount_retry(...))
__follow_mount_rcu(): verify that mount_lock remains unchanged
Linus Torvalds [Wed, 3 Aug 2022 17:35:43 +0000 (10:35 -0700)]
Merge tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache
Pull folio updates from Matthew Wilcox:
- Fix an accounting bug that made NR_FILE_DIRTY grow without limit
when running xfstests
- Convert more of mpage to use folios
- Remove add_to_page_cache() and add_to_page_cache_locked()
- Convert find_get_pages_range() to filemap_get_folios()
- Improvements to the read_cache_page() family of functions
- Remove a few unnecessary checks of PageError
- Some straightforward filesystem conversions to use folios
- Split PageMovable users out from address_space_operations into
their own movable_operations
- Convert aops->migratepage to aops->migrate_folio
- Remove nobh support (Christoph Hellwig)
* tag 'folio-6.0' of git://git.infradead.org/users/willy/pagecache: (78 commits)
fs: remove the NULL get_block case in mpage_writepages
fs: don't call ->writepage from __mpage_writepage
fs: remove the nobh helpers
jfs: stop using the nobh helper
ext2: remove nobh support
ntfs3: refactor ntfs_writepages
mm/folio-compat: Remove migration compatibility functions
fs: Remove aops->migratepage()
secretmem: Convert to migrate_folio
hugetlb: Convert to migrate_folio
aio: Convert to migrate_folio
f2fs: Convert to filemap_migrate_folio()
ubifs: Convert to filemap_migrate_folio()
btrfs: Convert btrfs_migratepage to migrate_folio
mm/migrate: Add filemap_migrate_folio()
mm/migrate: Convert migrate_page() to migrate_folio()
nfs: Convert to migrate_folio
btrfs: Convert btree_migratepage to migrate_folio
mm/migrate: Convert expected_page_refs() to folio_expected_refs()
mm/migrate: Convert buffer_migrate_page() to buffer_migrate_folio()
...
Linus Torvalds [Wed, 3 Aug 2022 16:45:08 +0000 (09:45 -0700)]
Merge tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
"Several core optimizations:
- threadgroup_rwsem write locking is skipped when configuring
controllers in empty subtrees.
Combined with CLONE_INTO_CGROUP, this allows the common static
usage pattern to not grab threadgroup_rwsem at all (glibc still
doesn't seem ready for CLONE_INTO_CGROUP unfortunately).
- threadgroup_rwsem used to be put into non-percpu mode by default
due to latency concerns in specific use cases. There's no reason
for everyone else to pay for it. Make the behavior optional.
- psi no longer allocates memory when disabled.
... along with some code cleanups"
* tag 'cgroup-for-5.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: Skip subtree root in cgroup_update_dfl_csses()
cgroup: remove "no" prefixed mount options
cgroup: Make !percpu threadgroup_rwsem operations optional
cgroup: Add "no" prefixed mount options
cgroup: Elide write-locking threadgroup_rwsem when updating csses on an empty subtree
cgroup.c: remove redundant check for mixable cgroup in cgroup_migrate_vet_dst
cgroup.c: add helper __cset_cgroup_from_root to cleanup duplicated codes
psi: dont alloc memory for psi by default