Guangbin Huang [Mon, 29 Nov 2021 14:00:22 +0000 (22:00 +0800)]
net: hns3: refine function hclge_cfg_mac_speed_dup_hw()
To reuse the code of converting speed of driver to speed of firmware in
function hclge_cfg_mac_speed_dup_hw(), encapsulate them into a new
function hclge_convert_to_fw_speed().
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yufeng Mo [Mon, 29 Nov 2021 14:00:21 +0000 (22:00 +0800)]
net: hns3: split function hns3_get_tx_timeo_queue_info()
Function hns3_get_tx_timeo_queue_info() is a bit too long. So add two
new functions hns3_dump_queue_stats() and hns3_dump_queue_reg() to
simplify code and improve code readability.
Signed-off-by: Yufeng Mo <moyufeng@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jie Wang [Mon, 29 Nov 2021 14:00:20 +0000 (22:00 +0800)]
net: hns3: refactor two hns3 debugfs functions
Use for statement to optimize some print work of function
hclge_dbg_dump_rst_info() and hclge_dbg_dump_mac_enable_status() to
improve code simplicity.
Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hao Chen [Mon, 29 Nov 2021 14:00:19 +0000 (22:00 +0800)]
net: hns3: refactor hns3_nic_reuse_page()
Split rx copybreak handle into a separate function from function
hns3_nic_reuse_page() to improve code simplicity.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Currently, the hclge_reset_prepare_general function uses the goto
statement to jump upwards, which increases code complexity and makes
the program structure difficult to understand. In addition, if
reset_pending is set, retry_cnt cannot be increased. This may result
in a failure to exit the retry or increase the number of retries.
Use the while statement instead to make the program easier to understand
and solve the problem that the goto statement cannot be exited.
Signed-off-by: Jiaran Zhang <zhangjiaran@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Menglong Dong [Sun, 28 Nov 2021 06:01:02 +0000 (14:01 +0800)]
net: snmp: add statistics for tcp small queue check
Once tcp small queue check failed in tcp_small_queue_check(), the
throughput of tcp will be limited, and it's hard to distinguish
whether it is out of tcp congestion control.
Add statistics of LINUX_MIB_TCPSMALLQUEUEFAILURE for this scene.
Signed-off-by: Menglong Dong <imagedong@tencent.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Leon Romanovsky [Sun, 28 Nov 2021 12:14:46 +0000 (14:14 +0200)]
devlink: Remove misleading internal_flags from health reporter dump
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET command doesn't have .doit callback
and has no use in internal_flags at all. Remove this misleading assignment.
Fixes: 862f45f6e290 ("devlink: Hang reporter's dump method on a dumpit cb") Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 29 Nov 2021 13:02:01 +0000 (13:02 +0000)]
Merge branch 'seville-shared-mdio'
Colin Foster says:
====================
update seville to use shared MDIO driver
This patch set exposes and utilizes the shared MDIO bus in
drivers/net/mdio/msio-mscc-miim.c
v3:
* Fix errors using uninitilized "dev" inside the probe function.
* Remove phy_regmap from the setup function, since it currently
isn't used
* Remove GCB_PHY_PHY_CFG definition from ocelot.h - it isn't used
yet...
v2:
* Error handling (thanks Andrew Lunn)
* Fix logic errors calling mscc_miim_setup during patch 1/3 (thanks
Jakub Kicinski)
* Remove unnecessary felix_mdio file (thanks Vladimir Oltean)
* Pass NULL to mscc_miim_setup instead of GCB_PHY_PHY_CFG, since the
phy reset isn't handled at that point of the Seville driver (patch
3/3)
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Switch to a shared MDIO access implementation by way of the mdio-mscc-miim
driver.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Switch seville to use of_mdiobus_register(bus, NULL) instead of just
mdiobus_register. This code is about to be pulled into a separate module
that can optionally define ports by the device_node.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vladimir Oltean <vladimir.oltean@nxp.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Foster [Mon, 29 Nov 2021 01:57:35 +0000 (17:57 -0800)]
net: mdio: mscc-miim: convert to a regmap implementation
Utilize regmap instead of __iomem to perform indirect mdio access. This
will allow for custom regmaps to be used by way of the mscc_miim_setup
function.
Signed-off-by: Colin Foster <colin.foster@in-advantage.com> Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch series add support for Microchip lan966x driver
The lan966x switch is a multi-port Gigabit AVB/TSN Ethernet Switch with
two integrated 10/100/1000Base-T PHYs. In addition to the integrated PHYs,
it supports up to 2RGMII/RMII, up to 3BASE-X/SERDES/2.5GBASE-X and up to
2 Quad-SGMII/Quad-USGMII interfaces.
Initially it adds support only for the ports to behave as simple
NIC cards. In the future patches it would be extended with other
functionality like Switchdev, PTP, Frame DMA, VCAP, etc.
v4->v5:
- more fixes to the reset of the switch, require all resources before
activating the hardware
- fix to lan966x-switch binding
- implement get/set_pauseparam in ethtool_ops
- stop calling lan966x_port_link_down when calling lan966x_port_pcs_set and
call it in lan966x_phylink_mac_link_down
v3->v4:
- add timeouts when injecting/extracting frames, in case the HW breaks
- simplify the creation of the IFH
- fix the order of operations in lan966x_cleanup_ports
- fixes to phylink based on Russel review
v2->v3:
- fix compiling issues for x86
- fix resource management in first patch
v1->v2:
- add new patch for MAINTAINERS
- add functions lan966x_mac_cpu_learn/forget
- fix build issues with second patch
- fix the reset of the switch, return error if there is no reset controller
- start to use phylink_mii_c22_pcs_decode_state and
phylink_mii_c22_pcs_encode_advertisement to remove duplicate code
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Mon, 29 Nov 2021 12:43:58 +0000 (13:43 +0100)]
net: lan966x: add ethtool configuration and statistics
This patch adds support for statistics counters for the network
interfaces. Also adds support for configuring the network interface via
ethtool like: speed, duplex etc.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Horatiu Vultur [Mon, 29 Nov 2021 12:43:57 +0000 (13:43 +0100)]
net: lan966x: add mactable support
This patch adds support for MAC table operations like add and forget.
Also add the functionality to read the MAC address from DT, if there is
no MAC set in DT it would use a random one.
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Walleij [Mon, 22 Nov 2021 22:35:30 +0000 (23:35 +0100)]
net: ixp4xx_hss: Convert to use DT probing
IXP4xx is being migrated to device tree only. Convert this
driver to use device tree probing.
Pull in all the boardfile code from the one boardfile and
make it local, pull all the boardfile parameters from the
device tree instead of the board file.
Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Walleij [Mon, 22 Nov 2021 22:35:29 +0000 (23:35 +0100)]
dt-bindings: net: Add bindings for IXP4xx V.35 WAN HSS
This adds device tree bindings for the IXP4xx V.35 WAN high
speed serial (HSS) link.
An example is added to the NPE example where the HSS appears
as a child.
Cc: devicetree@vger.kernel.org Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Alvin Šipraga [Mon, 29 Nov 2021 10:30:19 +0000 (11:30 +0100)]
net: dsa: rtl8365mb: set RGMII RX delay in steps of 0.3 ns
A contact at Realtek has clarified what exactly the units of RGMII RX
delay are. The answer is that the unit of RX delay is "about 0.3 ns".
Take this into account when parsing rx-internal-delay-ps by
approximating the closest step value. Delays of more than 2.1 ns are
rejected.
This obviously contradicts the previous assumption in the driver that a
step value of 4 was "about 2 ns", but Realtek also points out that it is
easy to find more than one RX delay step value which makes RGMII work.
Fixes: 0d4087d7fdeb ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC") Cc: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Acked-by: Arınç ÜNAL <arinc.unal@arinc9.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Alvin Šipraga [Mon, 29 Nov 2021 10:30:18 +0000 (11:30 +0100)]
net: dsa: rtl8365mb: fix garbled comment
Fixes: 0d4087d7fdeb ("net: dsa: realtek-smi: add rtl8365mb subdriver for RTL8365MB-VC") Signed-off-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
Ivan Vecera [Mon, 29 Nov 2021 09:58:50 +0000 (10:58 +0100)]
selftests: net: bridge: fix typo in vlan_filtering dependency test
Prior patch:
]# TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh
TEST: Vlan multicast snooping enable [ OK ]
Device "bridge" does not exist.
TEST: Disable multicast vlan snooping when vlan filtering is disabled [FAIL]
Vlan filtering is disabled but multicast vlan snooping is still enabled
After patch:
# TESTS=vlmc_filtering_test ./bridge_vlan_mcast.sh
TEST: Vlan multicast snooping enable [ OK ]
TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ]
Fixes: a55b5bb489f87c ("selftests: net: bridge: add test for vlan_filtering dependency") Cc: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Ivan Vecera <ivecera@redhat.com> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The mpls macros for_nexthops and change_nexthops were probably copied
from decnet or ipv4 but they grew a superfluous variable and lost a
beneficial "const".
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The BAM Data Multiplexer provides access to the network data channels
of modems integrated into many older Qualcomm SoCs, e.g. Qualcomm MSM8916
or MSM8974. This series adds a driver that allows using it.
All the changes in this patch series are based on a quite complicated
driver from Qualcomm [1]. The driver has been used in postmarketOS [2]
on various smartphones/tablets based on Qualcomm MSM8916 and MSM8974
for more than a year now with no reported problems. It works out of
the box with open-source WWAN userspace such as ModemManager.
Changes in v3:
- Clarify DT schema based on discussion
- Drop bam_dma/dmaengine patches since they already landed in 5.16
- Rebase on net-next
- Simplify cover letter and commit messages
Changes in v2:
- Rename "qcom,remote-power-collapse" -> "qcom,powered-remotely"
- Rebase on net-next and fix conflicts
- Rename network interfaces from "rmnet%d" -> "wwan%d"
- Fix wrong file name in MAINTAINERS entry
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The BAM Data Multiplexer provides access to the network data channels of
modems integrated into many older Qualcomm SoCs, e.g. Qualcomm MSM8916 or
MSM8974. It is built using a simple protocol layer on top of a DMA engine
(Qualcomm BAM) and bidirectional interrupts to coordinate power control.
The modem announces a fixed set of channels by sending an OPEN command.
The driver exports each channel as separate network interface so that
a connection can be established via QMI from userspace. The network
interface can work either in Ethernet or Raw-IP mode (configurable via
QMI). However, Ethernet mode seems to be broken with most firmwares
(network packets are actually received as Raw-IP), therefore the driver
only supports Raw-IP mode.
Note that the control channel (QMI/AT) is entirely separate from
BAM-DMUX and is already supported by the RPMSG_WWAN_CTRL driver.
The driver uses runtime PM to coordinate power control with the modem.
TX/RX buffers are put in a kind of "ring queue" and submitted via
the bam_dma driver of the DMAEngine subsystem.
Note that some newer firmware versions support QMAP ("rmnet" driver)
as additional multiplexing layer on top of BAM-DMUX, but this is not
currently supported by this driver.
Signed-off-by: Stephan Gerhold <stephan@gerhold.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Stephan Gerhold [Sat, 27 Nov 2021 17:31:07 +0000 (18:31 +0100)]
dt-bindings: net: Add schema for Qualcomm BAM-DMUX
The BAM Data Multiplexer provides access to the network data channels of
modems integrated into many older Qualcomm SoCs, e.g. Qualcomm MSM8916 or
MSM8974. It is built using a simple protocol layer on top of a DMA engine
(Qualcomm BAM) and bidirectional interrupts to coordinate power control.
The device tree node combines the incoming interrupt with the outgoing
interrupts (smem-states) as well as the two DMA channels, which allows
the BAM-DMUX driver to request all necessary resources.
Signed-off-by: Stephan Gerhold <stephan@gerhold.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Hao Chen [Sat, 27 Nov 2021 09:34:05 +0000 (17:34 +0800)]
net: hns3: use macro IANA_VXLAN_GPE_UDP_PORT to replace number 4790
This patch uses macro IANA_VXLAN_GPE_UDP_PORT to replace number 4790 for
cleanup.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Hao Chen [Sat, 27 Nov 2021 09:34:04 +0000 (17:34 +0800)]
net: vxlan: add macro definition for number of IANA VXLAN-GPE port
Add macro definition for number of IANA VXLAN-GPE port for generic use.
Signed-off-by: Hao Chen <chenhao288@hisilicon.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: Write lock dev_base_lock without disabling bottom halves.
The writer acquires dev_base_lock with disabled bottom halves.
The reader can acquire dev_base_lock without disabling bottom halves
because there is no writer in softirq context.
On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves (as in write_lock_bh()) and somewhere else with
enabled bottom halves (as by read_lock() in netstat_show()) followed by
disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
-> spin_lock_bh()). This is the reverse locking order.
All read_lock() invocation are from sysfs callback which are not invoked
from softirq context. Therefore there is no need to disable bottom
halves while acquiring a write lock.
Acquire the write lock of dev_base_lock without disabling bottom halves.
Reported-by: Pei Zhang <pezhang@redhat.com> Reported-by: Luis Claudio R. Goncalves <lgoncalv@redhat.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Tom Parkin [Fri, 26 Nov 2021 16:09:03 +0000 (16:09 +0000)]
net/l2tp: convert tunnel rwlock_t to rcu
Previously commit 63edf8f6b927 ("l2tp: Convert rwlock to RCU") converted
most, but not all, rwlock instances in the l2tp subsystem to RCU.
The remaining rwlock protects the per-tunnel hashlist of sessions which
is used for session lookups in the UDP-encap data path.
Convert the remaining rwlock to rcu to improve performance of UDP-encap
tunnels.
Note that the tunnel and session, which both live on RCU-protected
lists, use slightly different approaches to incrementing their refcounts
in the various getter functions.
The tunnel has to use refcount_inc_not_zero because the tunnel shutdown
process involves dropping the refcount to zero prior to synchronizing
RCU readers (via. kfree_rcu).
By contrast, the session shutdown removes the session from the list(s)
it is on, synchronizes with readers, and then decrements the session
refcount. Since the getter functions increment the session refcount
with the RCU read lock held we prevent getters seeing a zero session
refcount, and therefore don't need to use refcount_inc_not_zero.
Signed-off-by: Tom Parkin <tparkin@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 29 Nov 2021 12:05:52 +0000 (12:05 +0000)]
Merge branch 'mvneta-next'
Maxime Chevallier says:
====================
net: mvneta: mqprio cleanups and shaping support
This is the second version of the series that adds some improvements to the
existing mqprio implementation in mvneta, and adds support for
egress shaping offload.
The first 3 patches are some minor cleanups, such as using the
tc_mqprio_qopt_offload structure to get access to more offloading
options, cleaning the logic to detect whether or not we should offload
mqprio setting, and allowing to have a 1 to N mapping between TCs and
queues.
The last patch adds traffic shaping offload, using mvneta's per-queue
token buckets, allowing to limit rates from 10Kbps up to 5Gbps with
10Kbps increments.
This was tested only on an Armada 3720, with traffic up to 2.5Gbps.
Changes since V1 fixes the build for 32bits kernels, using the right
div helpers as suggested by Jakub.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The mvneta controller is able to do some tocken-bucket per-queue traffic
shaping. This commit adds support for setting these using the TC mqprio
interface.
The token-bucket parameters are customisable, but the current
implementation configures them to have a 10kbps resolution for the
rate limitation, since it allows to cover the whole range of max_rate
values from 10kbps to 5Gbps with 10kbps increments.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvneta: Allow having more than one queue per TC
The current mqprio implementation assumed that we are only using one
queue per TC. Use the offset and count parameters to allow using
multiple queues per TC. In that case, the controller will use a standard
round-robin algorithm to pick queues assigned to the same TC, with the
same priority.
This only applies to VLAN priorities in ingress traffic, each TC
corresponding to a vlan priority.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The qopt->hw flag is set by the TC code according to the offloading mode
asked by user. Don't force-set it in the driver, but instead read it to
make sure we do what's asked.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvneta: Use struct tc_mqprio_qopt_offload for MQPrio configuration
The struct tc_mqprio_qopt_offload is a container for struct tc_mqprio_qopt,
that allows passing extra parameters, such as traffic shaping. This commit
converts the current mqprio code to that new struct.
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
af_unix: Replace unix_table_lock with per-hash locks.
The hash table of AF_UNIX sockets is protected by a single big lock,
unix_table_lock. This series replaces it with small per-hash locks.
1st - 2nd : Misc refactoring
3rd - 8th : Separate BSD/abstract address logics
9th - 11th : Prep to save a hash in each socket
12th : Replace the big lock
13th : Speed up autobind()
Note to maintainers:
The 12th patch adds two kinds of Sparse warnings on patchwork:
about unix_table_double_lock/unlock()
We can avoid this by adding two apparent acquires/releases annotations,
but there are the same kinds of warnings about unix_state_double_lock().
about unix_next_socket() and unix_seq_stop() (/proc/net/unix)
This is because Sparse does not understand logic in unix_next_socket(),
which leaves a spin lock held until it returns NULL.
Also, tcp_seq_stop() causes a warning for the same reason.
These warnings seem reasonable, but let me know if there is any better way.
Please see [0] for details.
When we bind an AF_UNIX socket without a name specified, the kernel selects
an available one from 0x00000 to 0xFFFFF. unix_autobind() starts searching
from a number in the 'static' variable and increments it after acquiring
two locks.
If multiple processes try autobind, they obtain the same lock and check if
a socket in the hash list has the same name. If not, one process uses it,
and all except one end up retrying the _next_ number (actually not, it may
be incremented by the other processes). The more we autobind sockets in
parallel, the longer the latency gets. We can avoid such a race by
searching for a name from a random number.
These show latency in unix_autobind() while 64 CPUs are simultaneously
autobind-ing 1024 sockets for each.
The hash table of AF_UNIX sockets is protected by the single lock. This
patch replaces it with per-hash locks.
The effect is noticeable when we handle multiple sockets simultaneously.
Here is a test result on an EC2 c5.24xlarge instance. It shows latency
(under 10us only) in unix_insert_unbound_socket() while 64 CPUs creating
1024 sockets for each in parallel.
To replace unix_table_lock with per-hash locks in the next patch, we need
to save a hash in each socket because /proc/net/unix or BPF prog iterate
sockets while holding a hash table lock and release it later in a different
function.
Currently, we store a real/pseudo hash in struct unix_address. However, we
do not allocate it to unbound sockets, nor should we do just for that. For
this purpose, we can use sk_hash. Then, we no longer use the hash field in
struct unix_address and can remove it.
Also, this patch does
- rename unix_insert_socket() to unix_insert_unbound_socket()
- remove the redundant list argument from __unix_insert_socket() and
unix_insert_unbound_socket()
- use 'unsigned int' instead of 'unsigned' in __unix_set_addr_hash()
- remove 'inline' from unix_remove_socket() and
unix_insert_unbound_socket().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
af_unix: Remove UNIX_ABSTRACT() macro and test sun_path[0] instead.
In BSD and abstract address cases, we store sockets in the hash table with
keys between 0 and UNIX_HASH_SIZE - 1. However, the hash saved in a socket
varies depending on its address type; sockets with BSD addresses always
have UNIX_HASH_SIZE in their unix_sk(sk)->addr->hash.
This is just for the UNIX_ABSTRACT() macro used to check the address type.
The difference of the saved hashes comes from the first byte of the address
in the first place. So, we can test it directly.
Then we can keep a real hash in each socket and replace unix_table_lock
with per-hash locks in the later patch.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
af_unix: Allocate unix_address in unix_bind_(bsd|abstract)().
To terminate address with '\0' in unix_bind_bsd(), we add
unix_create_addr() and call it in unix_bind_bsd() and unix_bind_abstract().
Also, unix_bind_abstract() does not return -EEXIST. Only
kern_path_create() and vfs_mknod() in unix_bind_bsd() can return it,
so we move the last error check in unix_bind() to unix_bind_bsd().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
This patch removes unix_mkname() and postpones calculating a hash to
unix_bind_abstract(). Some BSD stuffs still remain in unix_bind()
though, the next patch packs them into unix_bind_bsd().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
af_unix: Cut unix_validate_addr() out of unix_mkname().
unix_mkname() tests socket address length and family and does some
processing based on the address type. It is called in the early stage,
and therefore some instructions are redundant and can end up in vain.
The address length/family tests are done twice in unix_bind(). Also, the
address type is rechecked later in unix_bind() and unix_find_other(), where
we can do the same processing. Moreover, in the BSD address case, the hash
is set to 0 but never used and confusing.
This patch moves the address tests out of unix_mkname(), and the following
patches move the other part into appropriate places and remove
unix_mkname() finally.
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
af_unix: Factorise unix_find_other() based on address types.
As done in the commit 4e5693846822 ("unix_bind(): take BSD and abstract
address cases into new helpers"), this patch moves BSD and abstract address
cases from unix_find_other() into unix_find_bsd() and unix_find_abstract().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We do not use struct socket in unix_autobind() and pass struct sock to
unix_bind_bsd() and unix_bind_abstract(). Let's pass it to unix_autobind()
as well.
Also, this patch fixes these errors by checkpatch.pl.
ERROR: do not use assignment in if condition
#1795: FILE: net/unix/af_unix.c:1795:
+ if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
CHECK: Logical continuations should be on the previous line
#1796: FILE: net/unix/af_unix.c:1796:
+ if (test_bit(SOCK_PASSCRED, &sock->flags) && !u->addr
+ && (err = unix_autobind(sock)) != 0)
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
The length of the AF_UNIX socket address contains an offset to the member
sun_path of struct sockaddr_un.
Currently, the preceding member is just sun_family, and its type is
sa_family_t and resolved to short. Therefore, the offset is represented by
sizeof(short). However, it is not clear and fragile to changes in struct
sockaddr_storage or sockaddr_un.
This commit makes it clear and robust by rewriting sizeof() with
offsetof().
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Wed, 24 Nov 2021 19:39:33 +0000 (14:39 -0500)]
bridge: use __set_bit in __br_vlan_set_default_pvid
The same optimization as the one in commit 47d95f289c07 ("net:
bridge: Slightly optimize 'find_portno()'") is needed for the
'changed' bitmap in __br_vlan_set_default_pvid().
This patch-set adds selftests for the new vlan multicast options that
were recently added. Most of the tests check for default values,
changing options and try to verify that the changes actually take
effect. The last test checks if the dependency between vlan_filtering
and mcast_vlan_snooping holds. The rest are pretty self-explanatory.
TEST: Vlan multicast snooping enable [ OK ]
TEST: Vlan global options existence [ OK ]
TEST: Vlan mcast_snooping global option default value [ OK ]
TEST: Vlan 10 multicast snooping control [ OK ]
TEST: Vlan mcast_querier global option default value [ OK ]
TEST: Vlan 10 multicast querier enable [ OK ]
TEST: Vlan 10 tagged IGMPv2 general query sent [ OK ]
TEST: Vlan 10 tagged MLD general query sent [ OK ]
TEST: Vlan mcast_igmp_version global option default value [ OK ]
TEST: Vlan mcast_mld_version global option default value [ OK ]
TEST: Vlan 10 mcast_igmp_version option changed to 3 [ OK ]
TEST: Vlan 10 tagged IGMPv3 general query sent [ OK ]
TEST: Vlan 10 mcast_mld_version option changed to 2 [ OK ]
TEST: Vlan 10 tagged MLDv2 general query sent [ OK ]
TEST: Vlan mcast_last_member_count global option default value [ OK ]
TEST: Vlan mcast_last_member_interval global option default value [ OK ]
TEST: Vlan 10 mcast_last_member_count option changed to 3 [ OK ]
TEST: Vlan 10 mcast_last_member_interval option changed to 200 [ OK ]
TEST: Vlan mcast_startup_query_interval global option default value [ OK ]
TEST: Vlan mcast_startup_query_count global option default value [ OK ]
TEST: Vlan 10 mcast_startup_query_interval option changed to 100 [ OK ]
TEST: Vlan 10 mcast_startup_query_count option changed to 3 [ OK ]
TEST: Vlan mcast_membership_interval global option default value [ OK ]
TEST: Vlan 10 mcast_membership_interval option changed to 200 [ OK ]
TEST: Vlan 10 mcast_membership_interval mdb entry expire [ OK ]
TEST: Vlan mcast_querier_interval global option default value [ OK ]
TEST: Vlan 10 mcast_querier_interval option changed to 100 [ OK ]
TEST: Vlan 10 mcast_querier_interval expire after outside query [ OK ]
TEST: Vlan mcast_query_interval global option default value [ OK ]
TEST: Vlan 10 mcast_query_interval option changed to 200 [ OK ]
TEST: Vlan mcast_query_response_interval global option default value [ OK ]
TEST: Vlan 10 mcast_query_response_interval option changed to 200 [ OK ]
TEST: Port vlan 10 option mcast_router default value [ OK ]
TEST: Port vlan 10 mcast_router option changed to 2 [ OK ]
TEST: Flood unknown vlan multicast packets to router port only [ OK ]
TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ]
====================
selftests: net: bridge: add test for vlan_filtering dependency
Add a test for dependency of mcast_vlan_snooping on vlan_filtering. If
vlan_filtering gets disabled, then mcast_vlan_snooping must be
automatically disabled as well.
TEST: Disable multicast vlan snooping when vlan filtering is disabled [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add tests for the new per-port/vlan mcast_router option, verify that
unknown multicast packets are flooded only to router ports.
TEST: Port vlan 10 option mcast_router default value [ OK ]
TEST: Port vlan 10 mcast_router option changed to 2 [ OK ]
TEST: Flood unknown vlan multicast packets to router port only [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add tests which change the new per-vlan mcast_query_interval and verify
the new value is in effect, also add a test to change
mcast_query_response_interval's value.
TEST: Vlan mcast_query_interval global option default value [ OK ]
TEST: Vlan 10 mcast_query_interval option changed to 200 [ OK ]
TEST: Vlan mcast_query_response_interval global option default value [ OK ]
TEST: Vlan 10 mcast_query_response_interval option changed to 200 [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add tests which change the new per-vlan mcast_querier_interval and
verify that it is taken into account when an outside querier is present.
TEST: Vlan mcast_querier_interval global option default value [ OK ]
TEST: Vlan 10 mcast_querier_interval option changed to 100 [ OK ]
TEST: Vlan 10 mcast_querier_interval expire after outside query [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add tests which change the new per-vlan startup query count/interval
options and verify the proper number of queries are sent in the expected
interval.
TEST: Vlan mcast_startup_query_interval global option default value [ OK ]
TEST: Vlan mcast_startup_query_count global option default value [ OK ]
TEST: Vlan 10 mcast_startup_query_interval option changed to 100 [ OK ]
TEST: Vlan 10 mcast_startup_query_count option changed to 3 [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Add tests which verify the default values of mcast_last_member_count
mcast_last_member_count and also try to change them.
TEST: Vlan mcast_last_member_count global option default value [ OK ]
TEST: Vlan mcast_last_member_interval global option default value [ OK ]
TEST: Vlan 10 mcast_last_member_count option changed to 3 [ OK ]
TEST: Vlan 10 mcast_last_member_interval option changed to 200 [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
selftests: net: bridge: add vlan mcast igmp/mld version tests
Add tests which change the new per-vlan IGMP/MLD versions and verify
that proper tagged general query packets are sent.
TEST: Vlan mcast_igmp_version global option default value [ OK ]
TEST: Vlan mcast_mld_version global option default value [ OK ]
TEST: Vlan 10 mcast_igmp_version option changed to 3 [ OK ]
TEST: Vlan 10 tagged IGMPv3 general query sent [ OK ]
TEST: Vlan 10 mcast_mld_version option changed to 2 [ OK ]
TEST: Vlan 10 tagged MLDv2 general query sent [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
selftests: net: bridge: add vlan mcast querier test
Add a test to try the new global vlan mcast_querier control and also
verify that tagged general query packets are properly generated when
querier is enabled for a single vlan.
TEST: Vlan mcast_querier global option default value [ OK ]
TEST: Vlan 10 multicast querier enable [ OK ]
TEST: Vlan 10 tagged IGMPv2 general query sent [ OK ]
TEST: Vlan 10 tagged MLD general query sent [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
selftests: net: bridge: add vlan mcast snooping control test
Add the first test for bridge per-vlan multicast snooping which checks
if control of the global and per-vlan options work as expected, joins
and leaves are tested at each option value.
TEST: Vlan multicast snooping enable [ OK ]
TEST: Vlan global options existence [ OK ]
TEST: Vlan mcast_snooping global option default value [ OK ]
TEST: Vlan 10 multicast snooping control [ OK ]
Signed-off-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
- netfilter: ctnetlink: fix error codes and flags used for kernel
side filtering of dumps
- netfilter: flowtable: fix IPv6 tunnel addr match
- ncsi: align payload to 32-bit to fix dropped packets
- iavf: fix deadlock and loss of config during VF interface reset
- ice: avoid bpf_prog refcount underflow
- ocelot: fix broken PTP over IP and PTP API violations
Misc:
- marvell: mvpp2: increase MTU limit when XDP enabled"
* tag 'net-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits)
net: dsa: microchip: implement multi-bridge support
net: mscc: ocelot: correctly report the timestamping RX filters in ethtool
net: mscc: ocelot: set up traps for PTP packets
net: ptp: add a definition for the UDP port for IEEE 1588 general messages
net: mscc: ocelot: create a function that replaces an existing VCAP filter
net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP
net: hns3: fix incorrect components info of ethtool --reset command
net: hns3: fix one incorrect value of page pool info when queried by debugfs
net: hns3: add check NULL address for page pool
net: hns3: fix VF RSS failed problem after PF enable multi-TCs
net: qed: fix the array may be out of bound
net/smc: Don't call clcsock shutdown twice when smc shutdown
net: vlan: fix underflow for the real_dev refcnt
ptp: fix filter names in the documentation
ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce()
nfc: virtual_ncidev: change default device permissions
net/sched: sch_ets: don't peek at classes beyond 'nbands'
net: stmmac: Disable Tx queues when reconfiguring the interface
selftests: tls: test for correct proto_ops
tls: fix replacing proto_ops
...
Oleksij Rempel [Fri, 26 Nov 2021 12:39:26 +0000 (13:39 +0100)]
net: dsa: microchip: implement multi-bridge support
Current driver version is able to handle only one bridge at time.
Configuring two bridges on two different ports would end up shorting this
bridges by HW. To reproduce it:
ip l a name br0 type bridge
ip l a name br1 type bridge
ip l s dev br0 up
ip l s dev br1 up
ip l s lan1 master br0
ip l s dev lan1 up
ip l s lan2 master br1
ip l s dev lan2 up
Ping on lan1 and get response on lan2, which should not happen.
This happened, because current driver version is storing one global "Port VLAN
Membership" and applying it to all ports which are members of any
bridge.
To solve this issue, we need to handle each port separately.
This patch is dropping the global port member storage and calculating
membership dynamically depending on STP state and bridge participation.
Note: STP support was broken before this patch and should be fixed
separately.
Fixes: 8135f56c18a2 ("net: dsa: microchip: break KSZ9477 DSA driver into two files") Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20211126123926.2981028-1-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Fri, 26 Nov 2021 20:19:13 +0000 (12:19 -0800)]
Merge tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull ACPI fixes from Rafael Wysocki:
"These fix a NULL pointer dereference in the CPPC library code and a
locking issue related to printing the names of ACPI device nodes in
the device properties framework.
Specifics:
- Fix NULL pointer dereference in the CPPC library code occuring on
hybrid systems without CPPC support (Rafael Wysocki).
- Avoid attempts to acquire a semaphore with interrupts off when
printing the names of ACPI device nodes and clean up code on top of
that fix (Sakari Ailus)"
* tag 'acpi-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
ACPI: CPPC: Add NULL pointer check to cppc_get_perf()
ACPI: Make acpi_node_get_parent() local
ACPI: Get acpi_device's parent from the parent field
Linus Torvalds [Fri, 26 Nov 2021 20:14:50 +0000 (12:14 -0800)]
Merge tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
Pull power management fixes from Rafael Wysocki:
"These address three issues in the intel_pstate driver and fix two
problems related to hibernation.
Specifics:
- Make intel_pstate work correctly on Ice Lake server systems with
out-of-band performance control enabled (Adamos Ttofari).
- Fix EPP handling in intel_pstate during CPU offline and online in
the active mode (Rafael Wysocki).
- Make intel_pstate support ITMT on asymmetric systems with
overclocking enabled (Srinivas Pandruvada).
- Fix hibernation image saving when using the user space interface
based on the snapshot special device file (Evan Green).
- Make the hibernation code release the snapshot block device using
the same mode that was used when acquiring it (Thomas Zeitlhofer)"
* tag 'pm-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
PM: hibernate: Fix snapshot partial write lengths
PM: hibernate: use correct mode for swsusp_close()
cpufreq: intel_pstate: ITMT support for overclocked system
cpufreq: intel_pstate: Fix active mode offline/online EPP handling
cpufreq: intel_pstate: Add Ice Lake server to out-of-band IDs
Linus Torvalds [Fri, 26 Nov 2021 20:01:31 +0000 (12:01 -0800)]
Merge tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
Pull fuse fix from Miklos Szeredi:
"Fix a regression caused by a bugfix in the previous release. The
symptom is a VM_BUG_ON triggered from splice to the fuse device.
Unfortunately the original bugfix was already backported to a number
of stable releases, so this fix-fix will need to be backported as
well"
* tag 'fuse-fixes-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
fuse: release pipe buf after last use
====================
Fix broken PTP over IP on Ocelot switches
Changes in v2: added patch 5, added Richard's ack for the whole series
sans patch 5 which is new.
Po Liu reported recently that timestamping PTP over IPv4 is broken using
the felix driver on NXP LS1028A. This has been known for a while, of
course, since it has always been broken. The reason is because IP PTP
packets are currently treated as unknown IP multicast, which is not
flooded to the CPU port in the ocelot driver design, so packets don't
reach the ptp4l program.
The series solves the problem by installing packet traps per port when
the timestamping ioctl is called, depending on the RX filter selected
(L2, L4 or both).
====================
Vladimir Oltean [Fri, 26 Nov 2021 17:28:45 +0000 (19:28 +0200)]
net: mscc: ocelot: correctly report the timestamping RX filters in ethtool
The driver doesn't support RX timestamping for non-PTP packets, but it
declares that it does. Restrict the reported RX filters to PTP v2 over
L2 and over L4.
Fixes: c9a4992060bc ("net: mscc: PTP Hardware Clock (PHC) support") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 26 Nov 2021 17:28:44 +0000 (19:28 +0200)]
net: mscc: ocelot: set up traps for PTP packets
IEEE 1588 support was declared too soon for the Ocelot switch. Out of
reset, this switch does not apply any special treatment for PTP packets,
i.e. when an event message is received, the natural tendency is to
forward it by MAC DA/VLAN ID. This poses a problem when the ingress port
is under a bridge, since user space application stacks (written
primarily for endpoint ports, not switches) like ptp4l expect that PTP
messages are always received on AF_PACKET / AF_INET sockets (depending
on the PTP transport being used), and never being autonomously
forwarded. Any forwarding, if necessary (for example in Transparent
Clock mode) is handled in software by ptp4l. Having the hardware forward
these packets too will cause duplicates which will confuse endpoints
connected to these switches.
So PTP over L2 barely works, in the sense that PTP packets reach the CPU
port, but they reach it via flooding, and therefore reach lots of other
unwanted destinations too. But PTP over IPv4/IPv6 does not work at all.
This is because the Ocelot switch have a separate destination port mask
for unknown IP multicast (which PTP over IP is) flooding compared to
unknown non-IP multicast (which PTP over L2 is) flooding. Specifically,
the driver allows the CPU port to be in the PGID_MC port group, but not
in PGID_MCIPV4 and PGID_MCIPV6. There are several presentations from
Allan Nielsen which explain that the embedded MIPS CPU on Ocelot
switches is not very powerful at all, so every penny they could save by
not allowing flooding to the CPU port module matters. Unknown IP
multicast did not make it.
The de facto consensus is that when a switch is PTP-aware and an
application stack for PTP is running, switches should have some sort of
trapping mechanism for PTP packets, to extract them from the hardware
data path. This avoids both problems:
(a) PTP packets are no longer flooded to unwanted destinations
(b) PTP over IP packets are no longer denied from reaching the CPU since
they arrive there via a trap and not via flooding
It is not the first time when this change is attempted. Last time, the
feedback from Allan Nielsen and Andrew Lunn was that the traps should
not be installed by default, and that PTP-unaware switching may be
desired for some use cases:
https://patchwork.ozlabs.org/project/netdev/patch/20190813025214.18601-5-yangbo.lu@nxp.com/
To address that feedback, the present patch adds the necessary packet
traps according to the RX filter configuration transmitted by user space
through the SIOCSHWTSTAMP ioctl. Trapping is done via VCAP IS2, where we
keep 5 filters, which are amended each time RX timestamping is enabled
or disabled on a port:
- 1 for PTP over L2
- 2 for PTP over IPv4 (UDP ports 319 and 320)
- 2 for PTP over IPv6 (UDP ports 319 and 320)
The cookie by which these filters (invisible to tc) are identified is
strategically chosen such that it does not collide with the filters used
for the ocelot-8021q tagging protocol by the Felix driver, or with the
MRP traps set up by the Ocelot library.
Other alternatives were considered, like patching user space to do
something, but there are so many ways in which PTP packets could be made
to reach the CPU, generically speaking, that "do what?" is a very valid
question. The ptp4l program from the linuxptp stack already attempts to
do something: it calls setsockopt(IP_ADD_MEMBERSHIP) (and
PACKET_ADD_MEMBERSHIP, respectively) which translates in both cases into
a dev_mc_add() on the interface, in the kernel:
https://github.com/richardcochran/linuxptp/blob/v3.1.1/udp.c#L73
https://github.com/richardcochran/linuxptp/blob/v3.1.1/raw.c
Reality shows that this is not sufficient in case the interface belongs
to a switchdev driver, as dev_mc_add() does not show the intention to
trap a packet to the CPU, but rather the intention to not drop it (it is
strictly for RX filtering, same as promiscuous does not mean to send all
traffic to the CPU, but to not drop traffic with unknown MAC DA). This
topic is a can of worms in itself, and it would be great if user space
could just stay out of it.
On the other hand, setting up PTP traps privately within the driver is
not new by any stretch of the imagination:
https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum_ptp.c#L833
https://elixir.bootlin.com/linux/v5.16-rc2/source/drivers/net/dsa/hirschmann/hellcreek.c#L1050
https://elixir.bootlin.com/linux/v5.16-rc2/source/include/linux/dsa/sja1105.h#L21
So this is the approach taken here as well. The difference here being
that we prepare and destroy the traps per port, dynamically at runtime,
as opposed to driver init time, because apparently, PTP-unaware
forwarding is a use case.
Fixes: c9a4992060bc ("net: mscc: PTP Hardware Clock (PHC) support") Reported-by: Po Liu <po.liu@nxp.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 26 Nov 2021 17:28:43 +0000 (19:28 +0200)]
net: ptp: add a definition for the UDP port for IEEE 1588 general messages
As opposed to event messages (Sync, PdelayReq etc) which require
timestamping, general messages (Announce, FollowUp etc) do not.
In PTP they are part of different streams of data.
IEEE 1588-2008 Annex D.2 "UDP port numbers" states that the UDP
destination port assigned by IANA is 319 for event messages, and 320 for
general messages. Yet the kernel seems to be missing the definition for
general messages. This patch adds it.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 26 Nov 2021 17:28:42 +0000 (19:28 +0200)]
net: mscc: ocelot: create a function that replaces an existing VCAP filter
VCAP (Versatile Content Aware Processor) is the TCAM-based engine behind
tc flower offload on ocelot, among other things. The ingress port mask
on which VCAP rules match is present as a bit field in the actual key of
the rule. This means that it is possible for a rule to be shared among
multiple source ports. When the rule is added one by one on each desired
port, that the ingress port mask of the key must be edited and rewritten
to hardware.
But the API in ocelot_vcap.c does not allow for this. For one thing,
ocelot_vcap_filter_add() and ocelot_vcap_filter_del() are not symmetric,
because ocelot_vcap_filter_add() works with a preallocated and
prepopulated filter and programs it to hardware, and
ocelot_vcap_filter_del() does both the job of removing the specified
filter from hardware, as well as kfreeing it. That is to say, the only
option of editing a filter in place, which is to delete it, modify the
structure and add it back, does not work because it results in
use-after-free.
This patch introduces ocelot_vcap_filter_replace, which trivially
reprograms a VCAP entry to hardware, at the exact same index at which it
existed before, without modifying any list or allocating any memory.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 26 Nov 2021 17:28:41 +0000 (19:28 +0200)]
net: mscc: ocelot: don't downgrade timestamping RX filters in SIOCSHWTSTAMP
The ocelot driver, when asked to timestamp all receiving packets, 1588
v1 or NTP, says "nah, here's 1588 v2 for you".
According to this discussion:
https://patchwork.kernel.org/project/netdevbpf/patch/20211104133204.19757-8-martin.kaistra@linutronix.de/#24577647
drivers that downgrade from a wider request to a narrower response (or
even a response where the intersection with the request is empty) are
buggy, and should return -ERANGE instead. This patch fixes that.
Fixes: c9a4992060bc ("net: mscc: PTP Hardware Clock (PHC) support") Suggested-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jie Wang [Fri, 26 Nov 2021 12:03:18 +0000 (20:03 +0800)]
net: hns3: fix incorrect components info of ethtool --reset command
Currently, HNS3 driver doesn't clear the reset flags of components after
successfully executing reset, it causes userspace info of
"Components reset" and "Components not reset" is incorrect.
So fix this problem by clear corresponding reset flag after reset process.
Fixes: 71efd7b1ab84 ("net: hns3: add support for triggering reset by ethtool") Signed-off-by: Jie Wang <wangjie125@huawei.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Hao Chen [Fri, 26 Nov 2021 12:03:17 +0000 (20:03 +0800)]
net: hns3: fix one incorrect value of page pool info when queried by debugfs
Currently, when user queries page pool info by debugfs command
"cat page_pool_info", the cnt of allocated page for page pool may be
incorrect because of memory inconsistency problem caused by compiler
optimization.
So this patch uses READ_ONCE() to read value of pages_state_hold_cnt to
fix this problem.
Fixes: f9b81c4439c4 ("net: hns3: debugfs add support dumping page pool info") Signed-off-by: Hao Chen <chenhao288@hisilicon.com> Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Guangbin Huang [Fri, 26 Nov 2021 12:03:15 +0000 (20:03 +0800)]
net: hns3: fix VF RSS failed problem after PF enable multi-TCs
When PF is set to multi-TCs and configured mapping relationship between
priorities and TCs, the hardware will active these settings for this PF
and its VFs.
In this case when VF just uses one TC and its rx packets contain priority,
and if the priority is not mapped to TC0, as other TCs of VF is not valid,
hardware always put this kind of packets to the queue 0. It cause this kind
of packets of VF can not be used RSS function.
To fix this problem, set tc mode of all unused TCs of VF to the setting of
TC0, then rx packet with priority which map to unused TC will be direct to
TC0.
Tony Lu [Fri, 26 Nov 2021 02:41:35 +0000 (10:41 +0800)]
net/smc: Don't call clcsock shutdown twice when smc shutdown
When applications call shutdown() with SHUT_RDWR in userspace,
smc_close_active() calls kernel_sock_shutdown(), and it is called
twice in smc_shutdown().
This fixes this by checking sk_state before do clcsock shutdown, and
avoids missing the application's call of smc_shutdown().
Ziyang Xuan [Fri, 26 Nov 2021 01:59:42 +0000 (09:59 +0800)]
net: vlan: fix underflow for the real_dev refcnt
Inject error before dev_hold(real_dev) in register_vlan_dev(),
and execute the following testcase:
ip link add dev dummy1 type dummy
ip link add name dummy1.100 link dummy1 type vlan id 100
ip link del dev dummy1
When the dummy netdevice is removed, we will get a WARNING as following:
=======================================================================
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 2 PID: 0 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0
and an endless loop of:
=======================================================================
unregister_netdevice: waiting for dummy1 to become free. Usage count = -1073741824
That is because dev_put(real_dev) in vlan_dev_free() be called without
dev_hold(real_dev) in register_vlan_dev(). It makes the refcnt of real_dev
underflow.
Move the dev_hold(real_dev) to vlan_dev_init() which is the call-back of
ndo_init(). That makes dev_hold() and dev_put() for vlan's real_dev
symmetrical.
Fixes: 52f553199741 ("net: vlan: fix a UAF in vlan_dev_real_dev()") Reported-by: Petr Machata <petrm@nvidia.com> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Link: https://lore.kernel.org/r/20211126015942.2918542-1-william.xuanziyang@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Julian Wiedmann [Fri, 26 Nov 2021 17:55:43 +0000 (18:55 +0100)]
ethtool: ioctl: fix potential NULL deref in ethtool_set_coalesce()
ethtool_set_coalesce() now uses both the .get_coalesce() and
.set_coalesce() callbacks. But the check for their availability is
buggy, so changing the coalesce settings on a device where the driver
provides only _one_ of the callbacks results in a NULL pointer
dereference instead of an -EOPNOTSUPP.
Fix the condition so that the availability of both callbacks is
ensured. This also matches the netlink code.
Note that reproducing this requires some effort - it only affects the
legacy ioctl path, and needs a specific combination of driver options:
- have .get_coalesce() and .coalesce_supported but no
.set_coalesce(), or
- have .set_coalesce() but no .get_coalesce(). Here eg. ethtool doesn't
cause the crash as it first attempts to call ethtool_get_coalesce()
and bails out on error.
Fixes: 8c9f3b32dcfc ("ethtool: extend coalesce setting uAPI with CQE mode") Cc: Yufeng Mo <moyufeng@huawei.com> Cc: Huazhong Tan <tanhuazhong@huawei.com> Cc: Andrew Lunn <andrew@lunn.ch> Cc: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Link: https://lore.kernel.org/r/20211126175543.28000-1-jwi@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Davide Caratti [Wed, 24 Nov 2021 16:14:40 +0000 (17:14 +0100)]
net/sched: sch_ets: don't peek at classes beyond 'nbands'
when the number of DRR classes decreases, the round-robin active list can
contain elements that have already been freed in ets_qdisc_change(). As a
consequence, it's possible to see a NULL dereference crash, caused by the
attempt to call cl->qdisc->ops->peek(cl->qdisc) when cl->qdisc is NULL:
v3: fix race between ets_qdisc_change() and ets_qdisc_dequeue() delisting
DRR classes beyond 'nbands' in ets_qdisc_change() with the qdisc lock
acquired, thanks to Cong Wang.
v2: when a NULL qdisc is found in the DRR active list, try to dequeue skb
from the next list item.
Yannick Vignon [Wed, 24 Nov 2021 15:47:31 +0000 (16:47 +0100)]
net: stmmac: Disable Tx queues when reconfiguring the interface
The Tx queues were not disabled in situations where the driver needed to
stop the interface to apply a new configuration. This could result in a
kernel panic when doing any of the 3 following actions:
* reconfiguring the number of queues (ethtool -L)
* reconfiguring the size of the ring buffers (ethtool -G)
* installing/removing an XDP program (ip l set dev ethX xdp)
Prevent the panic by making sure netif_tx_disable is called when stopping
an interface.
Without this patch, the following kernel panic can be observed when doing
any of the actions above:
Unable to handle kernel paging request at virtual address ffff80001238d040
[....]
Call trace:
dwmac4_set_addr+0x8/0x10
dev_hard_start_xmit+0xe4/0x1ac
sch_direct_xmit+0xe8/0x39c
__dev_queue_xmit+0x3ec/0xaf0
dev_queue_xmit+0x14/0x20
[...]
[ end trace 0000000000000002 ]---
Fixes: f43adc39cc614 ("net: stmmac: Add initial XDP support") Fixes: 53da2eeb88818 ("net: stmmac: Add support to Ethtool get/set ring parameters") Fixes: 229a0e8918f93 ("net: stmmac: add ethtool support for get/set channels") Signed-off-by: Yannick Vignon <yannick.vignon@nxp.com> Link: https://lore.kernel.org/r/20211124154731.1676949-1-yannick.vignon@oss.nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Fri, 26 Nov 2021 18:27:43 +0000 (10:27 -0800)]
Merge tag 'staging-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging fixes from Greg KH:
"Here are some small staging driver fixes and one driver removal for
5.16-rc3.
The fixes resolve a number of small issues found in 5.16-rc1, nothing
huge at all. The driver removal was due to a platform being removed in
5.16-rc1, but this driver was forgotten about. It wasn't being built
anymore so it's safe to delete.
All have been in linux-next for a while with no reported problems"
* tag 'staging-5.16-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: rtl8192e: Fix use after free in _rtl92e_pci_disconnect()
staging: greybus: Add missing rwsem around snd_ctl_remove() calls
staging: Remove Netlogic XLP network driver
staging: r8188eu: fix a memory leak in rtw_wx_read32()
staging: r8188eu: use GFP_ATOMIC under spinlock
staging: r8188eu: Use kzalloc() with GFP_ATOMIC in atomic context
staging/fbtft: Fix backlight
staging: r8188eu: Fix breakage introduced when 5G code was removed