Linus Torvalds [Tue, 2 Feb 2021 18:26:09 +0000 (10:26 -0800)]
Merge tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Networking fixes for 5.11-rc7, including fixes from bpf and mac80211
trees.
Current release - regressions:
- ip_tunnel: fix mtu calculation
- mlx5: fix function calculation for page trees
Previous releases - regressions:
- vsock: fix the race conditions in multi-transport support
- neighbour: prevent a dead entry from updating gc_list
- dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
Previous releases - always broken:
- bpf, cgroup: two copy_{from,to}_user() warn_on_once splats for BPF
cgroup getsockopt infra when user space is trying to race against
optlen, from Loris Reiff.
- mac80211: fix station rate table updates on assoc
- r8169: work around RTL8125 UDP HW bug
- igc: report speed and duplex as unknown when device is runtime
suspended
- rxrpc: fix deadlock around release of dst cached on udp tunnel"
* tag 'net-5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (36 commits)
net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
net: ipa: fix two format specifier errors
net: ipa: use the right accessor in ipa_endpoint_status_skip()
net: ipa: be explicit about endianness
net: ipa: add a missing __iomem attribute
net: ipa: pass correct dma_handle to dma_free_coherent()
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
net: mvpp2: TCAM entry enable should be written after SRAM data
net: lapb: Copy the skb before sending a packet
net/mlx5e: Release skb in case of failure in tc update skb
net/mlx5e: Update max_opened_tc also when channels are closed
net/mlx5: Fix leak upon failure of rule creation
net/mlx5: Fix function calculation for page trees
docs: networking: swap words in icmp_errors_use_inbound_ifaddr doc
udp: ipv4: manipulate network header of NATed UDP GRO fraglist
net: ip_tunnel: fix mtu calculation
vsock: fix the race conditions in multi-transport support
net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
ibmvnic: device remove has higher precedence over reset
...
Andreas Oetken [Tue, 2 Feb 2021 09:03:04 +0000 (10:03 +0100)]
net: hsr: align sup_multicast_addr in struct hsr_priv to u16 boundary
sup_multicast_addr is passed to ether_addr_equal for address comparison
which casts the address inputs to u16 leading to an unaligned access.
Aligning the sup_multicast_addr to u16 boundary fixes the issue.
Jakub Kicinski [Tue, 2 Feb 2021 16:51:25 +0000 (08:51 -0800)]
Merge tag 'mlx5-fixes-2021-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2021-02-01
Please note the first patch in this series
("Fix function calculation for page trees") is fixing a regression
due to previous fix in net which you didn't include in your previous
rc pr. So I hope this series will make it into your next rc pr,
so mlx5 won't be broken in the next rc.
* tag 'mlx5-fixes-2021-02-01' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: Release skb in case of failure in tc update skb
net/mlx5e: Update max_opened_tc also when channels are closed
net/mlx5: Fix leak upon failure of rule creation
net/mlx5: Fix function calculation for page trees
====================
Jakub Kicinski [Tue, 2 Feb 2021 16:48:17 +0000 (08:48 -0800)]
Merge branch 'net-ipa-a-few-bug-fixes'
Alex Elder says:
====================
net: ipa: a few bug fixes
This series fixes four minor bugs. The first two are things that
sparse points out. All four are very simple and each patch should
explain itself pretty well.
====================
Alex Elder [Mon, 1 Feb 2021 23:26:08 +0000 (17:26 -0600)]
net: ipa: use the right accessor in ipa_endpoint_status_skip()
When extracting the destination endpoint ID from the status in
ipa_endpoint_status_skip(), u32_get_bits() is used. This happens to
work, but it's wrong: the structure field is only 8 bits wide
instead of 32.
Fix this by using u8_get_bits() to get the destination endpoint ID.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alex Elder [Mon, 1 Feb 2021 23:26:07 +0000 (17:26 -0600)]
net: ipa: be explicit about endianness
Sparse warns that the assignment of the metadata mask for a QMAP
endpoint in ipa_endpoint_init_hdr_metadata_mask() is a bad
assignment. We know we want the mask value to be big endian, even
though the value we write is in host byte order. Use a __force
tag to indicate we really mean it.
Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Mon, 1 Feb 2021 20:50:56 +0000 (21:50 +0100)]
r8169: fix WoL on shutdown if CONFIG_DEBUG_SHIRQ is set
So far phy_disconnect() is called before free_irq(). If CONFIG_DEBUG_SHIRQ
is set and interrupt is shared, then free_irq() creates an "artificial"
interrupt by calling the interrupt handler. The "link change" flag is set
in the interrupt status register, causing phylib to eventually call
phy_suspend(). Because the net_device is detached from the PHY already,
the PHY driver can't recognize that WoL is configured and powers down the
PHY.
net/rds: restrict iovecs length for RDS_CMSG_RDMA_ARGS
syzbot found WARNING in rds_rdma_extra_size [1] when RDS_CMSG_RDMA_ARGS
control message is passed with user-controlled
0x40001 bytes of args->nr_local, causing order >= MAX_ORDER condition.
The exact value 0x40001 can be checked with UIO_MAXIOV which is 0x400.
So for kcalloc() 0x400 iovecs with sizeof(struct rds_iovec) = 0x10
is the closest limit, with 0x10 leftover.
Same condition is currently done in rds_cmsg_rdma_args().
Xie He [Mon, 1 Feb 2021 05:57:06 +0000 (21:57 -0800)]
net: lapb: Copy the skb before sending a packet
When sending a packet, we will prepend it with an LAPB header.
This modifies the shared parts of a cloned skb, so we should copy the
skb rather than just clone it, before we prepend the header.
In "Documentation/networking/driver.rst" (the 2nd point), it states
that drivers shouldn't modify the shared parts of a cloned skb when
transmitting.
The "dev_queue_xmit_nit" function in "net/core/dev.c", which is called
when an skb is being sent, clones the skb and sents the clone to
AF_PACKET sockets. Because the LAPB drivers first remove a 1-byte
pseudo-header before handing over the skb to us, if we don't copy the
skb before prepending the LAPB header, the first byte of the packets
received on AF_PACKET sockets can be corrupted.
Jakub Kicinski [Tue, 2 Feb 2021 16:37:00 +0000 (08:37 -0800)]
Merge tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
Johannes Berg says:
====================
Two fixes:
- station rate tables were not updated correctly
after association, leading to bad configuration
- rtl8723bs (staging) was initializing data incorrectly
after the previous fix and needed to move the init
later
* tag 'mac80211-for-net-2021-02-02' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211:
staging: rtl8723bs: Move wiphy setup to after reading the regulatory settings from the chip
mac80211: fix station rate table updates on assoc
====================
net/mlx5e: Update max_opened_tc also when channels are closed
max_opened_tc is used for stats, so that potentially non-zero stats
won't disappear when num_tc decreases. However, mlx5e_setup_tc_mqprio
fails to update it in the flow where channels are closed.
This commit fixes it. The new value of priv->channels.params.num_tc is
always checked on exit. In case of errors it will just be the old value,
and in case of success it will be the updated value.
Daniel Jurgens [Mon, 1 Feb 2021 16:11:10 +0000 (18:11 +0200)]
net/mlx5: Fix function calculation for page trees
The function calculation always results in a value of 0. This works
generally, but when the release all pages feature is enabled it will
result in crashes.
Fixes: 9bb0189d6ae5 ("net/mlx5: Maintain separate page trees for ECPF and PF functions") Signed-off-by: Daniel Jurgens <danielj@nvidia.com> Reported-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Jakub Kicinski [Tue, 2 Feb 2021 04:23:44 +0000 (20:23 -0800)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2021-02-01
This series contains updates to igc and i40e drivers.
Kai-Heng Feng fixes igc to report unknown speed and duplex during suspend
as an attempted read will cause errors.
Kevin Lo sets the default value to -IGC_ERR_NVM instead of success for
writing shadow RAM as this could miss a timeout. Also propagates the return
value for Flow Control configuration to properly pass on errors for igc.
Aleksandr reverts commit 346f948a5385 ("i40e: don't report link up for a VF
who hasn't enabled queues") as this can cause link flapping.
* '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue:
i40e: Revert "i40e: don't report link up for a VF who hasn't enabled queues"
igc: check return value of ret_val in igc_config_fc_after_link_up
igc: set the default return value to -IGC_ERR_NVM in igc_write_nvm_srwr
igc: Report speed and duplex as unknown when device is runtime suspended
====================
rmb() can be removed since:
1. pending_scrq() has dma_rmb() at the function end;
2. dma_rmb(), though weaker, is enough here.
Signed-off-by: Lijun Pan <ljp@linux.ibm.com> Acked-by: Dwip Banerjee <dnbanerg@us.ibm.com> Acked-by: Thomas Falcon <tlfalcon@linux.ibm.com> Reviewed-by: Brian King <brking@linux.vnet.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Lijun Pan [Sat, 30 Jan 2021 01:19:04 +0000 (19:19 -0600)]
ibmvnic: rework to ensure SCRQ entry reads are properly ordered
Move the dma_rmb() between pending_scrq() and ibmvnic_next_scrq()
into the end of pending_scrq() to save the duplicated code since
this dma_rmb will be used 3 times.
Signed-off-by: Lijun Pan <ljp@linux.ibm.com> Acked-by: Thomas Falcon <tlfalcon@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Dongseok Yi [Fri, 29 Jan 2021 23:13:27 +0000 (08:13 +0900)]
udp: ipv4: manipulate network header of NATed UDP GRO fraglist
UDP/IP header of UDP GROed frag_skbs are not updated even after NAT
forwarding. Only the header of head_skb from ip_finish_output_gso ->
skb_gso_segment is updated but following frag_skbs are not updated.
A call path skb_mac_gso_segment -> inet_gso_segment ->
udp4_ufo_fragment -> __udp_gso_segment -> __udp_gso_segment_list
does not try to update UDP/IP header of the segment list but copy
only the MAC header.
Update port, addr and check of each skb of the segment list in
__udp_gso_segment_list. It covers both SNAT and DNAT.
Vadim Fedorenko [Fri, 29 Jan 2021 22:27:47 +0000 (01:27 +0300)]
net: ip_tunnel: fix mtu calculation
dev->hard_header_len for tunnel interface is set only when header_ops
are set too and already contains full overhead of any tunnel encapsulation.
That's why there is not need to use this overhead twice in mtu calc.
Fixes: c5ed160e7b57 ("ip_gre: set dev->hard_header_len and dev->needed_headroom properly") Reported-by: Slava Bacherikov <mail@slava.cc> Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/1611959267-20536-1-git-send-email-vfedorenko@novek.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Alexander Popov [Mon, 1 Feb 2021 08:47:19 +0000 (11:47 +0300)]
vsock: fix the race conditions in multi-transport support
There are multiple similar bugs implicitly introduced by the
commit e29ab509ea9a91e6 ("vsock: add multi-transports support") and
commit 6373aba309c5f8ff ("vsock: prevent transport modules unloading").
The bug pattern:
[1] vsock_sock.transport pointer is copied to a local variable,
[2] lock_sock() is called,
[3] the local variable is used.
VSOCK multi-transport support introduced the race condition:
vsock_sock.transport value may change between [1] and [2].
Let's copy vsock_sock.transport pointer to local variables after
the lock_sock() call.
Jakub Kicinski [Tue, 2 Feb 2021 02:50:12 +0000 (18:50 -0800)]
Merge tag 'mlx5-dr-2021-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-dr-2021-01-29
Add support for Connect-X6DX Software steering
This series adds SW Steering support for Connect-X6DX.
Since the STE and actions formats are different on this new HW,
we implemented the HW specific STEv1 layer on the infrastructure
implemented in previous mlx5 DR patchset to support all the
functionalities as previous devices.
Most of the code in this series very is low level HW specific, we
implement the function pointers for the generic SW steering layer.
* tag 'mlx5-dr-2021-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5: DR, Allow SW steering for sw_owner_v2 devices
net/mlx5: DR, Copy all 64B whenever replacing STE in the head of miss-list
net/mlx5: DR, Use HW specific logic API when writing STE
net/mlx5: DR, Use the right size when writing partial STE into HW
net/mlx5: DR, Add STEv1 modify header logic
net/mlx5: DR, Add STEv1 action apply logic
net/mlx5: DR, Add STEv1 setters and getters
net/mlx5: DR, Allow native protocol support for HW STEv1
net/mlx5: DR, Add HW STEv1 match logic
net/mlx5: DR, Add match STEv1 structs to ifc
net/mlx5: DR, Fix potential shift wrapping of 32-bit value
====================
net: sched: replaced invalid qdisc tree flush helper in qdisc_replace
Commit a9ee80847d0d ("net: sched: introduce and use qdisc tree flush/purge helpers")
introduced qdisc tree flush/purge helpers, but erroneously used flush helper
instead of purge helper in qdisc_replace function.
This issue was found in our CI, that tests various qdisc setups by configuring
qdisc and sending data through it. Call of invalid helper sporadically leads
to corruption of vt_tree/cf_tree of hfsc_class that causes kernel oops:
Lijun Pan [Fri, 29 Jan 2021 04:34:01 +0000 (22:34 -0600)]
ibmvnic: device remove has higher precedence over reset
Returning -EBUSY in ibmvnic_remove() does not actually hold the
removal procedure since driver core doesn't care for the return
value (see __device_release_driver() in drivers/base/dd.c
calling dev->bus->remove()) though vio_bus_remove
(in arch/powerpc/platforms/pseries/vio.c) records the
return value and passes it on. [1]
During the device removal precedure, checking for resetting
bit is dropped so that we can continue executing all the
cleanup calls in the rest of the remove function. Otherwise,
it can cause latent memory leaks and kernel crashes.
[1] https://lore.kernel.org/linuxppc-dev/20210117101242.dpwayq6wdgfdzirl@pengutronix.de/T/#m48f5befd96bc9842ece2a3ad14f4c27747206a53 Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Fixes: 76c528960bab ("ibmvnic: Do not process device remove during device reset") Signed-off-by: Lijun Pan <ljp@linux.ibm.com> Link: https://lore.kernel.org/r/20210129043402.95744-1-ljp@linux.ibm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Florian, Andrew and Vladimir suggested at some point to use devlink for
reporting tables, features and debugging counters instead of using debugfs and
printk.
So, start by reporting the VLAN and FDB table sizes.
====================
DENG Qingfang [Sat, 30 Jan 2021 13:43:34 +0000 (21:43 +0800)]
net: dsa: mv88e6xxx: override existent unicast portvec in port_fdb_add
Having multiple destination ports for a unicast address does not make
sense.
Make port_db_load_purge override existent unicast portvec instead of
adding a new port bit.
Fixes: a943b2a0b2ac ("net: dsa: mv88e6xxx: handle multiple ports in ATU") Signed-off-by: DENG Qingfang <dqfext@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Link: https://lore.kernel.org/r/20210130134334.10243-1-dqfext@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
VF queues were not brought up when PF was brought up after being
downed if the VF driver disabled VFs queues during PF down.
This could happen in some older or external VF driver implementations.
The problem was that PF driver used vf->queues_enabled as a condition
to decide what link-state it would send out which caused the issue.
Remove the check for vf->queues_enabled in the VF link notify.
Now VF will always be notified of the current link status.
Also remove the queues_enabled member from i40e_vf structure as it is
not used anymore. Otherwise VNF implementation was broken and caused
a link flap.
The original commit was a workaround to avoid breaking existing VFs though
it's really a fault of the VF code not the PF. The commit should be safe to
revert as all of the VFs we know of have been fixed. Also, since we now
know there is a related bug in the workaround, removing it is preferred.
Fixes: 346f948a5385 ("i40e: don't report link up for a VF who hasn't enabled") Signed-off-by: Aleksandr Loktionov <aleksandr.loktionov@intel.com> Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com> Tested-by: Konrad Jankowski <konrad0.jankowski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Linus Torvalds [Mon, 1 Feb 2021 19:15:57 +0000 (11:15 -0800)]
Merge tag 'media/v5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"The rockship rkisp1 driver will be promoted from staging in 5.11.
While not too late, do a few uAPI changes which are needed to better
support its functionalities"
* tag 'media/v5.11-3' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: rockchip: rkisp1: extend uapi array sizes
media: rockchip: rkisp1: carry ip version information
media: rockchip: rkisp1: reduce number of histogram grid elements in uapi
media: rkisp1: stats: mask the hist_bins values
media: rkisp1: stats: remove a wrong cast to u8
media: rkisp1: uapi: change hist_bins array type from __u16 to __u32
Hans de Goede [Mon, 1 Feb 2021 15:29:56 +0000 (16:29 +0100)]
staging: rtl8723bs: Move wiphy setup to after reading the regulatory settings from the chip
Commit 73c6fe37ed09 ("staging: rtl8723bs: fix wireless regulatory API
misuse") moved the wiphy_apply_custom_regulatory() call to earlier in the
driver's init-sequence, so that it gets called before wiphy_register().
But at this point in time the eFuses which code the regulatory-settings
for the chip have not been read by the driver yet, causing
_rtw_reg_apply_flags() to set the IEEE80211_CHAN_DISABLED flag on *all*
channels.
On the device where I initially tested the fix, a Jumper EZpad 7 tablet,
this does not cause any problems because shortly after init the
rtw_reg_notifier() gets called fixing things up. I guess this happens
into response to receiving a (broadcast) packet with regulatory info
from the access-point ?
But on another device with a RTL8723BS wifi chip, an Acer Switch 10E
(SW3-016), the rtw_reg_notifier() never gets called. I assume that some
fuse has been set on this device to ignore regulatory info received from
access-points.
This means that on the Acer the driver is stuck in a state with all
channels disabled, leading to non working Wifi.
We cannot move the wiphy_apply_custom_regulatory() call back, because
that call must be made before the wiphy_register() call.
Instead move the entire rtw_wdev_alloc() call to after the Efuses have
been read, fixing all channels being disabled in the initial channel-map.
Fixes: 73c6fe37ed09 ("staging: rtl8723bs: fix wireless regulatory API misuse") Cc: Johannes Berg <johannes.berg@intel.com> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Link: https://lore.kernel.org/r/20210201152956.370186-2-hdegoede@redhat.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Kevin Lo [Thu, 7 Jan 2021 06:10:38 +0000 (14:10 +0800)]
igc: check return value of ret_val in igc_config_fc_after_link_up
Check return value from ret_val to make error check actually work.
Fixes: 3a56c17c0297 ("igc: Add setup link functionality") Signed-off-by: Kevin Lo <kevlo@kevlo.org> Acked-by: Sasha Neftin <sasha.neftin@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Kevin Lo [Sun, 20 Dec 2020 14:18:19 +0000 (22:18 +0800)]
igc: set the default return value to -IGC_ERR_NVM in igc_write_nvm_srwr
This patch sets the default return value to -IGC_ERR_NVM in
igc_write_nvm_srwr. Without this change it wouldn't lead to a shadow RAM
write EEWR timeout.
Fixes: d6c58def7f21 ("igc: Add NVM support") Signed-off-by: Kevin Lo <kevlo@kevlo.org> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Kai-Heng Feng [Wed, 2 Dec 2020 07:50:17 +0000 (15:50 +0800)]
igc: Report speed and duplex as unknown when device is runtime suspended
Similar to commit 778be3147ac3 ("igb: Report speed and duplex as unknown
when device is runtime suspended"), if we try to read speed and duplex
sysfs while the device is runtime suspended, igc will complain and
stops working:
The more generic approach will be wrap get_link_ksettings() with begin()
and complete() callbacks, and calls runtime resume and runtime suspend
routine respectively. However, igc is like igb, runtime resume routine
uses rtnl_lock() which upper ethtool layer also uses.
So to prevent a deadlock on rtnl, take a different approach, use
pm_runtime_suspended() to avoid reading register while device is runtime
suspended.
Felix Fietkau [Mon, 1 Feb 2021 08:33:24 +0000 (09:33 +0100)]
mac80211: fix station rate table updates on assoc
If the driver uses .sta_add, station entries are only uploaded after the sta
is in assoc state. Fix early station rate table updates by deferring them
until the sta has been uploaded.
Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Link: https://lore.kernel.org/r/20210201083324.3134-1-nbd@nbd.name
[use rcu_access_pointer() instead since we won't dereference here] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Linus Torvalds [Sun, 31 Jan 2021 19:48:12 +0000 (11:48 -0800)]
Merge tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fix from Borislav Petkov:
"A single fix for objtool to generate proper unwind info for newer
toolchains which do not generate section symbols anymore. And a
cleanup ontop.
This was originally going to go during the next merge window but
people can already trigger a build error with binutils-2.36 which
doesn't emit section symbols - something which objtool relies on - so
let's expedite it"
* tag 'x86_entry_for_v5.11_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/entry: Remove put_ret_addr_in_rdi THUNK macro argument
x86/entry: Emit a symbol for register restoring thunk
Linus Torvalds [Sun, 31 Jan 2021 19:40:57 +0000 (11:40 -0800)]
Merge tag 'timers-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer fix from Thomas Gleixner:
"A fix for handling advertised, but non-existent 146818 RTCs correctly.
With the recent UIP handling changes the time readout of non-existent
RTCs hangs forever as the read returns always 0xFF which means the UIP
bit is set.
Sanity check the RTC before registering by checking the RTC_VALID
register for correctness"
* tag 'timers-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtc: mc146818: Detect and handle broken RTCs
Linus Torvalds [Sun, 31 Jan 2021 19:39:32 +0000 (11:39 -0800)]
Merge tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull single stepping fix from Thomas Gleixner:
"A single fix for the single step reporting regression caused by
getting the condition wrong when moving SYSCALL_EMU away from TIF
flags"
[ There's apparently another problem too, fix pending ]
* tag 'core-urgent-2021-01-31' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
entry: Unbreak single step reporting behaviour
Linus Torvalds [Sun, 31 Jan 2021 19:37:43 +0000 (11:37 -0800)]
Merge tag 'powerpc-5.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"One fix for a bug in our soft interrupt masking, which could lead to
interrupt replaying recursing, causing spurious interrupts.
Thanks to Nicholas Piggin"
* tag 'powerpc-5.11-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: prevent recursive replay_soft_interrupts causing superfluous interrupt
Linus Torvalds [Sun, 31 Jan 2021 19:19:12 +0000 (11:19 -0800)]
Merge tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client fixes from Trond Myklebust:
- SUNRPC: Handle 0 length opaque XDR object data properly
- Fix a layout segment leak in pnfs_layout_process()
- pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
- pNFS/NFSv4: Improve rejection of out-of-order layouts
- pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
* tag 'nfs-for-5.11-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
SUNRPC: Handle 0 length opaque XDR object data properly
SUNRPC: Move simple_get_bytes and simple_get_netobj into private header
pNFS/NFSv4: Improve rejection of out-of-order layouts
pNFS/NFSv4: Update the layout barrier when we schedule a layoutreturn
pNFS/NFSv4: Try to return invalid layout in pnfs_layout_process()
pNFS/NFSv4: Fix a layout segment leak in pnfs_layout_process()
Linus Walleij [Sat, 2 Jan 2021 23:15:10 +0000 (00:15 +0100)]
leds: rt8515: Add Richtek RT8515 LED driver
This adds a driver for the Richtek RT8515 dual channel
torch/flash white LED driver.
This LED driver is found in some mobile phones from
Samsung such as the GT-S7710 and GT-I8190.
A V4L interface is added.
We do not have a proper datasheet for the RT8515 but
it turns out that RT9387A has a public datasheet and
is essentially the same chip. We designed the driver
in accordance with this datasheet. The day someone
needs to drive a RT9387A this driver can probably
easily be augmented to handle that chip too.
Sakari Ailus, Pavel Machek and Andy Shevchenko helped
significantly in getting this driver right.
Cc: Sakari Ailus <sakari.ailus@iki.fi> Cc: newbytee@protonmail.com Cc: Stephan Gerhold <stephan@gerhold.net> Cc: linux-media@vger.kernel.org Cc: phone-devel@vger.kernel.org Reviewed-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Linus Walleij <linus.walleij@linaro.org> Signed-off-by: Pavel Machek <pavel@ucw.cz>
Andrea Righi [Wed, 25 Nov 2020 15:18:22 +0000 (16:18 +0100)]
leds: trigger: fix potential deadlock with libata
We have the following potential deadlock condition:
========================================================
WARNING: possible irq lock inversion dependency detected
5.10.0-rc2+ #25 Not tainted
--------------------------------------------------------
swapper/3/0 just changed the state of lock: ffff8880063bd618 (&host->lock){-...}-{2:2}, at: ata_bmdma_interrupt+0x27/0x200
but this lock took another, HARDIRQ-READ-unsafe lock in the past:
(&trig->leddev_list_lock){.+.?}-{2:2}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
This lockdep splat is reported after:
commit 6836e0c29878 ("locking: More accurate annotations for read_lock()")
To clarify:
- read-locks are recursive only in interrupt context (when
in_interrupt() returns true)
- after acquiring host->lock in CPU1, another cpu (i.e. CPU2) may call
write_lock(&trig->leddev_list_lock) that would be blocked by CPU0
that holds trig->leddev_list_lock in read-mode
- when CPU1 (ata_ac_complete()) tries to read-lock
trig->leddev_list_lock, it would be blocked by the write-lock waiter
on CPU2 (because we are not in interrupt context, so the read-lock is
not recursive)
- at this point if an interrupt happens on CPU0 and
ata_bmdma_interrupt() is executed it will try to acquire host->lock,
that is held by CPU1, that is currently blocked by CPU2, so:
* CPU0 blocked by CPU1
* CPU1 blocked by CPU2
* CPU2 blocked by CPU0
*** DEADLOCK ***
The deadlock scenario is better represented by the following schema
(thanks to Boqun Feng <boqun.feng@gmail.com> for the schema and the
detailed explanation of the deadlock condition):
CPU 0: CPU 1: CPU 2:
----- ----- -----
led_trigger_event():
read_lock(&trig->leddev_list_lock);
<workqueue>
ata_hsm_qc_complete():
spin_lock_irqsave(&host->lock);
write_lock(&trig->leddev_list_lock);
ata_port_freeze():
ata_do_link_abort():
ata_qc_complete():
ledtrig_disk_activity():
led_trigger_blink_oneshot():
read_lock(&trig->leddev_list_lock);
// ^ not in in_interrupt() context, so could get blocked by CPU 2
<interrupt>
ata_bmdma_interrupt():
spin_lock_irqsave(&host->lock);
Fix by using read_lock_irqsave/irqrestore() in led_trigger_event(), so
that no interrupt can happen in between, preventing the deadlock
condition.
Apply the same change to led_trigger_blink_setup() as well, since the
same deadlock scenario can also happen in power_supply_update_bat_leds()
-> led_trigger_blink() -> led_trigger_blink_setup() (workqueue context),
and potentially prevent other similar usages.
Linus Torvalds [Sun, 31 Jan 2021 01:51:06 +0000 (17:51 -0800)]
Merge tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"Four cifs patches found in additional testing of the conversion to the
new mount API: three small option processing ones, and one fixing domain
based DFS referrals"
* tag '5.11-rc5-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: fix dfs domain referrals
cifs: returning mount parm processing errors correctly
cifs: fix mounts to subdirectories of target
cifs: ignore auto and noauto options if given
Linus Torvalds [Sun, 31 Jan 2021 01:42:42 +0000 (17:42 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Two minor fixes in drivers. Both changing strings (one in a comment,
one in a module help text) with no code impact"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: qla2xxx: Fix description for parameter ql2xenforce_iocb_limit
scsi: target: iscsi: Fix typo in comment
Linus Torvalds [Sat, 30 Jan 2021 19:48:57 +0000 (11:48 -0800)]
Merge tag 's390-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Vasily Gorbik:
- Fix max number of VCPUs reported via ultravisor information sysfs
interface.
- Fix memory leaks during vfio-ap resources clean up on KVM pointer
invalidation notification.
- Fix potential specification exception by avoiding unnecessary
interrupts disable after queue reset in vfio-ap.
* tag 's390-5.11-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: uv: Fix sysfs max number of VCPUs reporting
s390/vfio-ap: No need to disable IRQ after queue reset
s390/vfio-ap: clean up vfio_ap resources when KVM pointer invalidated
Chinmay Agarwal [Wed, 27 Jan 2021 16:54:54 +0000 (22:24 +0530)]
neighbour: Prevent a dead entry from updating gc_list
Following race condition was detected:
<CPU A, t0> - neigh_flush_dev() is under execution and calls
neigh_mark_dead(n) marking the neighbour entry 'n' as dead.
<CPU B, t1> - Executing: __netif_receive_skb() ->
__netif_receive_skb_core() -> arp_rcv() -> arp_process().arp_process()
calls __neigh_lookup() which takes a reference on neighbour entry 'n'.
<CPU A, t2> - Moves further along neigh_flush_dev() and calls
neigh_cleanup_and_release(n), but since reference count increased in t2,
'n' couldn't be destroyed.
<CPU B, t3> - Moves further along, arp_process() and calls
neigh_update()-> __neigh_update() -> neigh_update_gc_list(), which adds
the neighbour entry back in gc_list(neigh_mark_dead(), removed it
earlier in t0 from gc_list)
This leads to 'n' still being part of gc_list, but the actual
neighbour structure has been freed.
The situation can be prevented from happening if we disallow a dead
entry to have any possibility of updating gc_list. This is what the
patch intends to achieve.
Fixes: 6547d8e2146f ("neighbor: Fix locking order for gc_list changes") Signed-off-by: Chinmay Agarwal <chinagar@codeaurora.org> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20210127165453.GA20514@chinagar-linux.qualcomm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Neal Cardwell [Fri, 29 Jan 2021 18:54:38 +0000 (13:54 -0500)]
tcp: shrink inet_connection_sock icsk_mtup enabled and probe_size
This commit shrinks inet_connection_sock by 4 bytes, by shrinking
icsk_mtup.enabled from 32 bits to 1 bit, and shrinking
icsk_mtup.probe_size from s32 to an unsuigned 31 bit field.
This is to save space to compensate for the recent introduction of a
new u32 in inet_connection_sock, icsk_probes_tstamp, in the recent bug
fix commit 8a248b911650 ("tcp: fix TCP_USER_TIMEOUT with zero window").
This should not change functionality, since icsk_mtup.enabled is only
ever set to 0 or 1, and icsk_mtup.probe_size can only be either 0
or a positive MTU value returned by tcp_mss_to_mtu()
David Howells [Fri, 29 Jan 2021 23:53:50 +0000 (23:53 +0000)]
rxrpc: Fix deadlock around release of dst cached on udp tunnel
AF_RXRPC sockets use UDP ports in encap mode. This causes socket and dst
from an incoming packet to get stolen and attached to the UDP socket from
whence it is leaked when that socket is closed.
When a network namespace is removed, the wait for dst records to be cleaned
up happens before the cleanup of the rxrpc and UDP socket, meaning that the
wait never finishes.
Fix this by moving the rxrpc (and, by dependence, the afs) private
per-network namespace registrations to the device group rather than subsys
group. This allows cached rxrpc local endpoints to be cleared and their
UDP sockets closed before we try waiting for the dst records.
The symptom is that lines looking like the following:
unregister_netdevice: waiting for lo to become free
get emitted at regular intervals after running something like the
referenced syzbot test.
Thanks to Vadim for tracking this down and work out the fix.
Reported-by: syzbot+df400f2f24a1677cd7e0@syzkaller.appspotmail.com Reported-by: Vadim Fedorenko <vfedorenko@novek.ru> Fixes: 912123eeaf62 ("rxrpc: Use the UDP encap_rcv hook") Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Vadim Fedorenko <vfedorenko@novek.ru> Link: https://lore.kernel.org/r/161196443016.3868642.5577440140646403533.stgit@warthog.procyon.org.uk Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: bridge: drop hosts limit sysfs and add a comment
As recently discussed[1] we should stop extending the bridge sysfs
support for new options and move to using netlink only, so patch 01
drops the recently added hosts limit sysfs support which is still in
net-next only and patch 02 adds comments in br_sysfs_br/if.c to warn
against adding new sysfs options.
Jakub Kicinski [Sat, 30 Jan 2021 05:25:27 +0000 (21:25 -0800)]
Merge branch 'tag_8021q-for-ocelot-switches'
Vladimir Oltean says:
====================
tag_8021q for Ocelot switches
The Felix switch inside LS1028A has an issue. It has a 2.5G CPU port,
and the external ports, in the majority of use cases, run at 1G. This
means that, when the CPU injects traffic into the switch, it is very
easy to run into congestion. This is not to say that it is impossible to
enter congestion even with all ports running at the same speed, just
that the default configuration is already very prone to that by design.
Normally, the way to deal with that is using Ethernet flow control
(PAUSE frames).
However, this functionality is not working today with the ENETC - Felix
switch pair. The hardware issue is undergoing documentation right now as
an erratum within NXP, but several customers have been requesting a
reasonable workaround for it.
In truth, the LS1028A has 2 internal port pairs. The lack of flow control
is an issue only when NPI mode (Node Processor Interface, aka the mode
where the "CPU port module", which carries DSA-style tagged packets, is
connected to a regular Ethernet port) is used, and NPI mode is supported
by Felix on a single port.
In past BSPs, we have had setups where both internal port pairs were
enabled. We were advertising the following setup:
"data port" "control port"
(2.5G) (1G)
eno2 eno3
^ ^
| |
| regular | DSA-tagged
| frames | frames
| |
v v
swp4 swp5
This works but is highly unpractical, due to NXP shifting the task of
designing a functional system (choosing which port to use, depending on
type of traffic required) up to the end user. The swpN interfaces would
have to be bridged with swp4, in order for the eno2 "data port" to have
access to the outside network. And the swpN interfaces would still be
capable of IP networking. So running a DHCP client would give us two IP
interfaces from the same subnet, one assigned to eno2, and the other to
swpN (0, 1, 2, 3).
Also, the dual port design doesn't scale. When attaching another DSA
switch to a Felix port, the end result is that the "data port" cannot
carry any meaningful data to the external world, since it lacks the DSA
tags required to traverse the sja1105 switches below. All that traffic
needs to go through the "control port".
So in newer BSPs there was a desire to simplify that setup, and only
have one internal port pair:
eno2 eno3
^
|
| DSA-tagged x disabled
| frames
|
v
swp4 swp5
However, this setup only exacerbates the issue of not having flow
control on the NPI port, since that is the only port now. Also, there
are use cases that still require the "data port", such as IEEE 802.1CB
(TSN stream identification doesn't work over an NPI port), source
MAC address learning over NPI, etc.
Again, there is a desire to keep the simplicity of the single internal
port setup, while regaining the benefits of having a dedicated data port
as well. And this series attempts to deliver just that.
So the NPI functionality is disabled conditionally. Its purpose was:
- To ensure individually addressable ports on TX. This can be replaced
by using some designated VLAN tags which are pushed by the DSA tagger
code, then removed by the switch (so they are invisible to the outside
world and to the user).
- To ensure source port identification on RX. Again, this can be
replaced by using some designated VLAN tags to encapsulate all RX
traffic (each VLAN uniquely identifies a source port). The DSA tagger
determines which port it was based on the VLAN number, then removes
that header.
- To deliver PTP timestamps. This cannot be obtained through VLAN
headers, so we need to take a step back and see how else we can do
that. The Microchip Ocelot-1 (VSC7514 MIPS) driver performs manual
injection/extraction from the CPU port module using register-based
MMIO, and not over Ethernet. We will need to do the same from DSA,
which makes this tagger a sort of hybrid between DSA and pure
switchdev.
====================
Vladimir Oltean [Fri, 29 Jan 2021 01:00:09 +0000 (03:00 +0200)]
net: dsa: felix: perform switch setup for tag_8021q
Unlike sja1105, the only other user of the software-defined tag_8021q.c
tagger format, the implementation we choose for the Felix DSA switch
driver preserves full functionality under a vlan_filtering bridge
(i.e. IP termination works through the DSA user ports under all
circumstances).
The tag_8021q protocol just wants:
- Identifying the ingress switch port based on the RX VLAN ID, as seen
by the CPU. We achieve this by using the TCAM engines (which are also
used for tc-flower offload) to push the RX VLAN as a second, outer
tag, on egress towards the CPU port.
- Steering traffic injected into the switch from the network stack
towards the correct front port based on the TX VLAN, and consuming
(popping) that header on the switch's egress.
A tc-flower pseudocode of the static configuration done by the driver
would look like this:
$ tc qdisc add dev <cpu-port> clsact
$ for eth in swp0 swp1 swp2 swp3; do \
tc filter add dev <cpu-port> egress flower indev ${eth} \
action vlan push id <rxvlan> protocol 802.1ad; \
tc filter add dev <cpu-port> ingress protocol 802.1Q flower
vlan_id <txvlan> action vlan pop \
action mirred egress redirect dev ${eth}; \
done
but of course since DSA does not register network interfaces for the CPU
port, this configuration would be impossible for the user to do. Also,
due to the same reason, it is impossible for the user to inadvertently
delete these rules using tc. These rules do not collide in any way with
tc-flower, they just consume some TCAM space, which is something we can
live with.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 01:00:08 +0000 (03:00 +0200)]
net: dsa: add a second tagger for Ocelot switches based on tag_8021q
There are use cases for which the existing tagger, based on the NPI
(Node Processor Interface) functionality, is insufficient.
Namely:
- Frames injected through the NPI port bypass the frame analyzer, so no
source address learning is performed, no TSN stream classification,
etc.
- Flow control is not functional over an NPI port (PAUSE frames are
encapsulated in the same Extraction Frame Header as all other frames)
- There can be at most one NPI port configured for an Ocelot switch. But
in NXP LS1028A and T1040 there are two Ethernet CPU ports. The non-NPI
port is currently either disabled, or operated as a plain user port
(albeit an internally-facing one). Having the ability to configure the
two CPU ports symmetrically could pave the way for e.g. creating a LAG
between them, to increase bandwidth seamlessly for the system.
So there is a desire to have an alternative to the NPI mode. This change
keeps the default tagger for the Seville and Felix switches as "ocelot",
but it can be changed via the following device attribute:
Vladimir Oltean [Fri, 29 Jan 2021 01:00:07 +0000 (03:00 +0200)]
net: dsa: felix: convert to the new .change_tag_protocol DSA API
In expectation of the new tag_ocelot_8021q tagger implementation, we
need to be able to do runtime switchover between one tagger and another.
So we must structure the existing code for the current NPI-based tagger
in a certain way.
We move the felix_npi_port_init function in expectation of the future
driver configuration necessary for tag_ocelot_8021q: we would like to
not have the NPI-related bits interspersed with the tag_8021q bits.
is perhaps non-trivial, but is nonetheless non-functional. The PGID_UC
(replicator for unknown unicast) is already configured out of hardware
reset to flood to all ports except ocelot->num_phys_ports (the CPU port
module). All we change is that we use a read-modify-write to only add
the CPU port module to the unknown unicast replicator, as opposed to
doing a full write to the register.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 01:00:06 +0000 (03:00 +0200)]
net: dsa: allow changing the tag protocol via the "tagging" device attribute
Currently DSA exposes the following sysfs:
$ cat /sys/class/net/eno2/dsa/tagging
ocelot
which is a read-only device attribute, introduced in the kernel as
commit e5869f4d54bc ("net: dsa: Expose tagging protocol to user-space"),
and used by libpcap since its commit 993db3800d7d ("Add support for DSA
link-layer types").
It would be nice if we could extend this device attribute by making it
writable:
$ echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
This is useful with DSA switches that can make use of more than one
tagging protocol. It may be useful in dsa_loop in the future too, to
perform offline testing of various taggers, or for changing between dsa
and edsa on Marvell switches, if that is desirable.
In terms of implementation, drivers can support this feature by
implementing .change_tag_protocol, which should always leave the switch
in a consistent state: either with the new protocol if things went well,
or with the old one if something failed. Teardown of the old protocol,
if necessary, must be handled by the driver.
Some things remain as before:
- The .get_tag_protocol is currently only called at probe time, to load
the initial tagging protocol driver. Nonetheless, new drivers should
report the tagging protocol in current use now.
- The driver should manage by itself the initial setup of tagging
protocol, no later than the .setup() method, as well as destroying
resources used by the last tagger in use, no earlier than the
.teardown() method.
For multi-switch DSA trees, error handling is a bit more complicated,
since e.g. the 5th out of 7 switches may fail to change the tag
protocol. When that happens, a revert to the original tag protocol is
attempted, but that may fail too, leaving the tree in an inconsistent
state despite each individual switch implementing .change_tag_protocol
transactionally. Since the intersection between drivers that implement
.change_tag_protocol and drivers that support D in DSA is currently the
empty set, the possibility for this error to happen is ignored for now.
Testing:
$ insmod mscc_felix.ko
[ 79.549784] mscc_felix 0000:00:00.5: Adding to iommu group 14
[ 79.565712] mscc_felix 0000:00:00.5: Failed to register DSA switch: -517
$ insmod tag_ocelot.ko
$ rmmod mscc_felix.ko
$ insmod mscc_felix.ko
[ 97.261724] libphy: VSC9959 internal MDIO bus: probed
[ 97.267363] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 0
[ 97.274998] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 1
[ 97.282561] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 2
[ 97.289700] mscc_felix 0000:00:00.5: Found PCS at internal MDIO address 3
[ 97.599163] mscc_felix 0000:00:00.5 swp0 (uninitialized): PHY [0000:00:00.3:10] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.862034] mscc_felix 0000:00:00.5 swp1 (uninitialized): PHY [0000:00:00.3:11] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 97.950731] mscc_felix 0000:00:00.5 swp0: configuring for inband/qsgmii link mode
[ 97.964278] 8021q: adding VLAN 0 to HW filter on device swp0
[ 98.146161] mscc_felix 0000:00:00.5 swp2 (uninitialized): PHY [0000:00:00.3:12] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.238649] mscc_felix 0000:00:00.5 swp1: configuring for inband/qsgmii link mode
[ 98.251845] 8021q: adding VLAN 0 to HW filter on device swp1
[ 98.433916] mscc_felix 0000:00:00.5 swp3 (uninitialized): PHY [0000:00:00.3:13] driver [Microsemi GE VSC8514 SyncE] (irq=POLL)
[ 98.485542] mscc_felix 0000:00:00.5: configuring for fixed/internal link mode
[ 98.503584] mscc_felix 0000:00:00.5: Link is Up - 2.5Gbps/Full - flow control rx/tx
[ 98.527948] device eno2 entered promiscuous mode
[ 98.544755] DSA: tree 0 setup
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=2.337 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.754 ms
^C
- 10.0.0.1 ping statistics -
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max = 0.754/1.545/2.337 ms
ip link set swp0 down
ip link set swp1 down
ip link set swp2 down
ip link set swp3 down
ip link set swp5 down
ip link set eno2 down
echo ocelot-8021q > /sys/class/net/eno2/dsa/tagging
ip link set eno2 up
ip link set swp0 up
ip link set swp1 up
ip link set swp2 up
ip link set swp3 up
ip link set swp5 up
$ ./test_ocelot_8021q.sh
./test_ocelot_8021q.sh: line 9: echo: write error: Protocol not available
$ rmmod tag_ocelot.ko
rmmod: can't unload module 'tag_ocelot': Resource temporarily unavailable
$ insmod tag_ocelot_8021q.ko
$ ./test_ocelot_8021q.sh
$ cat /sys/class/net/eno2/dsa/tagging
ocelot-8021q
$ rmmod tag_ocelot.ko
$ rmmod tag_ocelot_8021q.ko
rmmod: can't unload module 'tag_ocelot_8021q': Resource temporarily unavailable
$ ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
64 bytes from 10.0.0.1: seq=0 ttl=64 time=0.953 ms
64 bytes from 10.0.0.1: seq=1 ttl=64 time=0.787 ms
64 bytes from 10.0.0.1: seq=2 ttl=64 time=0.771 ms
$ rmmod mscc_felix.ko
[ 645.544426] mscc_felix 0000:00:00.5: Link is Down
[ 645.838608] DSA: tree 0 torn down
$ rmmod tag_ocelot_8021q.ko
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 01:00:05 +0000 (03:00 +0200)]
net: dsa: keep a copy of the tagging protocol in the DSA switch tree
Cascading DSA switches can be done multiple ways. There is the brute
force approach / tag stacking, where one upstream switch, located
between leaf switches and the host Ethernet controller, will just
happily transport the DSA header of those leaf switches as payload.
For this kind of setups, DSA works without any special kind of treatment
compared to a single switch - they just aren't aware of each other.
Then there's the approach where the upstream switch understands the tags
it transports from its leaves below, as it doesn't push a tag of its own,
but it routes based on the source port & switch id information present
in that tag (as opposed to DMAC & VID) and it strips the tag when
egressing a front-facing port. Currently only Marvell implements the
latter, and Marvell DSA trees contain only Marvell switches.
So it is safe to say that DSA trees already have a single tag protocol
shared by all switches, and in fact this is what makes the switches able
to understand each other. This fact is also implied by the fact that
currently, the tagging protocol is reported as part of a sysfs installed
on the DSA master and not per port, so it must be the same for all the
ports connected to that DSA master regardless of the switch that they
belong to.
It's time to make this official and enforce it (yes, this also means we
won't have any "switch understands tag to some extent but is not able to
speak it" hardware oddities that we'll support in the future).
This is needed due to the imminent introduction of the dsa_switch_ops::
change_tag_protocol driver API. When that is introduced, we'll have
to notify switches of the tagging protocol that they're configured to
use. Currently the tag_ops structure pointer is held only for CPU ports.
But there are switches which don't have CPU ports and nonetheless still
need to be configured. These would be Marvell leaf switches whose
upstream port is just a DSA link. How do we inform these of their
tagging protocol setup/deletion?
One answer to the above would be: iterate through the DSA switch tree's
ports once, list the CPU ports, get their tag_ops, then iterate again
now that we have it, and notify everybody of that tag_ops. But what to
do if conflicts appear between one cpu_dp->tag_ops and another? There's
no escaping the fact that conflict resolution needs to be done, so we
can be upfront about it.
Ease our work and just keep the master copy of the tag_ops inside the
struct dsa_switch_tree. Reference counting is now moved to be per-tree
too, instead of per-CPU port.
There are many places in the data path that access master->dsa_ptr->tag_ops
and we would introduce unnecessary performance penalty going through yet
another indirection, so keep those right where they are.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 01:00:04 +0000 (03:00 +0200)]
net: dsa: document the existing switch tree notifiers and add a new one
The existence of dsa_broadcast has generated some confusion in the past:
https://www.mail-archive.com/netdev@vger.kernel.org/msg365042.html
So let's document the existing dsa_port_notify and dsa_broadcast
functions and explain when each of them should be used.
Also, in fact, the in-between function has always been there but was
lacking a name, and is the main reason for this patch: dsa_tree_notify.
Refactor dsa_broadcast to use it.
This patch also moves dsa_broadcast (a top-level function) to dsa2.c,
where it really belonged in the first place, but had no companion so it
stood with dsa_port_notify.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 01:00:03 +0000 (03:00 +0200)]
net: mscc: ocelot: don't use NPI tag prefix for the CPU port module
Context: Ocelot switches put the injection/extraction frame header in
front of the Ethernet header. When used in NPI mode, a DSA master would
see junk instead of the destination MAC address, and it would most
likely drop the packets. So the Ocelot frame header can have an optional
prefix, which is just "ff:ff:ff:ff:ff:fe > ff:ff:ff:ff:ff:ff" padding
put before the actual tag (still before the real Ethernet header) such
that the DSA master thinks it's looking at a broadcast frame with a
strange EtherType.
Unfortunately, a lesson learned in commit f48a79338b4f ("net: mscc:
ocelot: eliminate confusion between CPU and NPI port") seems to have
been forgotten in the meanwhile.
The CPU port module and the NPI port have independent settings for the
length of the tag prefix. However, the driver is using the same variable
to program both of them.
There is no reason really to use any tag prefix with the CPU port
module, since that is not connected to any Ethernet port. So this patch
makes the inj_prefix and xtr_prefix variables apply only to the NPI
port (which the switchdev ocelot_vsc7514 driver does not use).
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 01:00:02 +0000 (03:00 +0200)]
net: mscc: ocelot: reapply bridge forwarding mask on bonding join/leave
Applying the bridge forwarding mask currently is done only on the STP
state changes for any port. But it depends on both STP state changes,
and bonding interface state changes. Export the bit that recalculates
the forwarding mask so that it could be reused, and call it when a port
starts and stops offloading a bonding interface.
Now that the logic is split into a separate function, we can rename "p"
into "port", since the "port" variable was already taken in
ocelot_bridge_stp_state_set. Also, we can rename "i" into "lag", to make
it more clear what is it that we're iterating through.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 01:00:01 +0000 (03:00 +0200)]
net: mscc: ocelot: store a namespaced VCAP filter ID
We will be adding some private VCAP filters that should not interfere in
any way with the filters added using tc-flower. So we need to allocate
some IDs which will not be used by tc.
Currently ocelot uses an u32 id derived from the flow cookie, which in
itself is an unsigned long. This is a problem in itself, since on 64 bit
systems, sizeof(unsigned long)=8, so the driver is already truncating
these.
Create a struct ocelot_vcap_id which contains the full unsigned long
cookie from tc, as well as a boolean that is supposed to namespace the
filters added by tc with the ones that aren't.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 01:00:00 +0000 (03:00 +0200)]
net: mscc: ocelot: export VCAP structures to include/soc/mscc
The Felix driver will need to preinstall some VCAP filters for its
tag_8021q implementation (outside of the tc-flower offload logic), so
these need to be exported to the common includes.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Vladimir Oltean [Fri, 29 Jan 2021 00:59:59 +0000 (02:59 +0200)]
net: dsa: tag_8021q: add helpers to deduce whether a VLAN ID is RX or TX VLAN
The sja1105 implementation can be blind about this, but the felix driver
doesn't do exactly what it's being told, so it needs to know whether it
is a TX or an RX VLAN, so it can install the appropriate type of TCAM
rule.
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Heiner Kallweit [Thu, 28 Jan 2021 22:01:54 +0000 (23:01 +0100)]
r8169: work around RTL8125 UDP hw bug
It was reported that on RTL8125 network breaks under heavy UDP load,
e.g. torrent traffic ([0], from comment 27). Realtek confirmed a hw bug
and provided me with a test version of the r8125 driver including a
workaround. Tests confirmed that the workaround fixes the issue.
I modified the original version of the workaround to meet mainline
code style.
v2:
- rebased to net
v3:
- make rtl_skb_is_udp() more robust and use skb_header_pointer()
to access the ip(v6) header
v4:
- remove dependency on ptp_classify.h
- replace magic number with offsetof(struct udphdr, len)
Fixes: 80025be3880f ("r8169: add support for RTL8125") Tested-by: xplo <xplo.bn@gmail.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Acked-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/6e453d49-1801-e6de-d5f7-d7e6c7526c8f@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ronak Doshi [Thu, 28 Jan 2021 02:08:16 +0000 (18:08 -0800)]
vmxnet3: Remove buf_info from device accessible structures
buf_info structures in RX & TX queues are private driver data that
do not need to be visible to the device. Although there is physical
address and length in the queue descriptor that points to these
structures, their layout is not standardized, and device never looks
at them.
So lets allocate these structures in non-DMA-able memory, and fill
physical address as all-ones and length as zero in the queue
descriptor.
That should alleviate worries brought by Martin Radev in
https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20210104/022829.html
that malicious vmxnet3 device could subvert SVM/TDX guarantees.
Signed-off-by: Petr Vandrovec <petr@vmware.com> Signed-off-by: Ronak Doshi <doshir@vmware.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
net: Remove redundant calls of sk_tx_queue_clear().
The commit 2752c24eeec5 ("net: Do not clear the sock TX queue in
sk_set_socket()") removes sk_tx_queue_clear() from sk_set_socket() and adds
it instead in sk_alloc() and sk_clone_lock() to fix an issue introduced in
the commit 20e5169d94b8 ("net: Introduce sk_tx_queue_mapping"). On the
other hand, the original commit had already put sk_tx_queue_clear() in
sk_prot_alloc(): the callee of sk_alloc() and sk_clone_lock(). Thus
sk_tx_queue_clear() is called twice in each path.
If we remove sk_tx_queue_clear() in sk_alloc() and sk_clone_lock(), it
currently works well because (i) sk_tx_queue_mapping is defined between
sk_dontcopy_begin and sk_dontcopy_end, and (ii) sock_copy() called after
sk_prot_alloc() in sk_clone_lock() does not overwrite sk_tx_queue_mapping.
However, if we move sk_tx_queue_mapping out of the no copy area, it
introduces a bug unintentionally.
Therefore, this patch adds a compile-time check to take care of the order
of sock_copy() and sk_tx_queue_clear() and removes sk_tx_queue_clear() from
sk_prot_alloc() so that it does the only allocation and its callers
initialize fields.
Ahmed S. Darwish [Thu, 28 Jan 2021 19:48:02 +0000 (20:48 +0100)]
net: arcnet: Fix RESET flag handling
The main arcnet interrupt handler calls arcnet_close() then
arcnet_open(), if the RESET status flag is encountered.
This is invalid:
1) In general, interrupt handlers should never call ->ndo_stop() and
->ndo_open() functions. They are usually full of blocking calls and
other methods that are expected to be called only from drivers
init and exit code paths.
2) arcnet_close() contains a del_timer_sync(). If the irq handler
interrupts the to-be-deleted timer, del_timer_sync() will just loop
forever.
3) arcnet_close() also calls tasklet_kill(), which has a warning if
called from irq context.
4) For device reset, the sequence "arcnet_close(); arcnet_open();" is
not complete. Some children arcnet drivers have special init/exit
code sequences, which then embed a call to arcnet_open() and
arcnet_close() accordingly. Check drivers/net/arcnet/com20020.c.
Run the device RESET sequence from a scheduled workqueue instead.
Jakub Kicinski [Sat, 30 Jan 2021 04:42:25 +0000 (20:42 -0800)]
Merge branch 'net-hns3-updates-for-next'
Huazhong Tan says:
====================
net: hns3: updates for -next
This patchset adds dump tm info of nodes, priority and qset in debugfs.
Three debugfs files tm_nodes, tm_priority and tm_qset are created in
new tm directory, and use cat command to dump their info, for examples:
Guangbin Huang [Thu, 28 Jan 2021 11:51:36 +0000 (19:51 +0800)]
net: hns3: add debugfs support for tm nodes, priority and qset info
In order to query tm info of nodes, priority and qset
for debugging, adds three debugfs files tm_nodes,
tm_priority and tm_qset in newly created tm directory.
Unlike previous debugfs commands, these three files
just support read ops, so they only support to use cat
command to dump their info.
The new tm file style is acccording to suggestion from
Jakub Kicinski's opinion as link https://lkml.org/lkml/2020/9/29/2101.
Signed-off-by: Guangbin Huang <huangguangbin2@huawei.com> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
====================
net: add support for ip generic checksum offload for gre
This patchset it to add ip generic csum processing first in
skb_csum_hwoffload_help() in Patch 1/2 and then add csum
offload support for GRE header in Patch 2/2.
====================
Xin Long [Thu, 28 Jan 2021 09:18:32 +0000 (17:18 +0800)]
ip_gre: add csum offload support for gre header
This patch is to add csum offload support for gre header:
On the TX path in gre_build_header(), when CHECKSUM_PARTIAL's set
for inner proto, it will calculate the csum for outer proto, and
inner csum will be offloaded later. Otherwise, CHECKSUM_PARTIAL
and csum_start/offset will be set for outer proto, and the outer
csum will be offloaded later.
On the GSO path in gre_gso_segment(), when CHECKSUM_PARTIAL is
not set for inner proto and the hardware supports csum offload,
CHECKSUM_PARTIAL and csum_start/offset will be set for outer
proto, and outer csum will be offloaded later. Otherwise, it
will do csum for outer proto by calling gso_make_checksum().
Note that SCTP has to do the csum by itself for non GSO path in
sctp_packet_pack(), as gre_build_header() can't handle the csum
with CHECKSUM_PARTIAL set for SCTP CRC csum offload.
v1->v2:
- remove the SCTP part, as GRE dev doesn't support SCTP CRC CSUM
and it will always do checksum for SCTP in sctp_packet_pack()
when it's not a GSO packet.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Xin Long [Thu, 28 Jan 2021 09:18:31 +0000 (17:18 +0800)]
net: support ip generic csum processing in skb_csum_hwoffload_help
NETIF_F_IP|IPV6_CSUM feature flag indicates UDP and TCP csum offload
while NETIF_F_HW_CSUM feature flag indicates ip generic csum offload
for HW, which includes not only for TCP/UDP csum, but also for other
protocols' csum like GRE's.
However, in skb_csum_hwoffload_help() it only checks features against
NETIF_F_CSUM_MASK(NETIF_F_HW|IP|IPV6_CSUM). So if it's a non TCP/UDP
packet and the features doesn't support NETIF_F_HW_CSUM, but supports
NETIF_F_IP|IPV6_CSUM only, it would still return 0 and leave the HW
to do csum.
This patch is to support ip generic csum processing by checking
NETIF_F_HW_CSUM for all protocols, and check (NETIF_F_IP_CSUM |
NETIF_F_IPV6_CSUM) only for TCP and UDP.
Note that we're using skb->csum_offset to check if it's a TCP/UDP
proctol, this might be fragile. However, as Alex said, for now we
only have a few L4 protocols that are requesting Tx csum offload,
we'd better fix this until a new protocol comes with a same csum
offset.
v1->v2:
- not extend skb->csum_not_inet, but use skb->csum_offset to tell
if it's an UDP/TCP csum packet.
v2->v3:
- add a note in the changelog, as Willem suggested.
Suggested-by: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
All patches are by me and target the mcp251xfd driver. The first 4
patches update the information regarding the "85% of (FSYSCLK/2)"
errata. The other 4 are misc cleanups, unitfy error messages, add
missing postfix to a macro, simplify the return of a function, and
make use of dev_err_probe() in the mcp251xfd_probe() function.
====================