Arnd Bergmann [Fri, 11 Dec 2020 21:36:38 +0000 (13:36 -0800)]
kbuild: avoid static_assert for genksyms
genksyms does not know or care about the _Static_assert() built-in, and
sometimes falls back to ignoring the later symbols, which causes
undefined behavior such as
WARNING: modpost: EXPORT symbol "ethtool_set_ethtool_phy_ops" [vmlinux] version generation failed, symbol will not be versioned.
ld: net/ethtool/common.o: relocation R_AARCH64_ABS32 against `__crc_ethtool_set_ethtool_phy_ops' can not be used when making a shared object
net/ethtool/common.o:(_ftrace_annotated_branch+0x0): dangerous relocation: unsupported relocation
Redefine static_assert for genksyms to avoid that.
Link: https://lkml.kernel.org/r/20201203230955.1482058-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Ard Biesheuvel <ardb@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Kees Cook <keescook@chromium.org> Cc: Rikard Falkeborn <rikard.falkeborn@gmail.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miles Chen [Fri, 11 Dec 2020 21:36:31 +0000 (13:36 -0800)]
proc: use untagged_addr() for pagemap_read addresses
When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag.
To fix it, we should untag the userspace pointers in pagemap_read().
I tested with 5.10-rc4 and the issue remains.
Explanation from Catalin in [1]:
"Arguably, that's a user-space bug since tagged file offsets were never
supported. In this case it's not even a tag at bit 56 as per the arm64
tagged address ABI but rather down to bit 47. You could say that the
problem is caused by the C library (malloc()) or whoever created the
tagged vaddr and passed it to this function. It's not a kernel
regression as we've never supported it.
Now, pagemap is a special case where the offset is usually not
generated as a classic file offset but rather derived by shifting a
user virtual address. I guess we can make a concession for pagemap
(only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"
My test code is based on [2]:
A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8
userspace program:
uint64 OsLayer::VirtualToPhysical(void *vaddr) {
uint64 frame, paddr, pfnmask, pagemask;
int pagesize = sysconf(_SC_PAGESIZE);
off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
int fd = open(kPagemapPath, O_RDONLY);
...
if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
int err = errno;
string errtxt = ErrorString(err);
if (fd >= 0)
close(fd);
return 0;
}
...
}
/* watch out for wraparound */
// svpfn == 0xb400007662f54
// (mm->task_size >> PAGE) == 0x8000000
if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
start_vaddr = end_vaddr;
ret = 0;
while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
int len;
unsigned long end;
...
}
...
}
Andrew Morton [Fri, 11 Dec 2020 21:36:27 +0000 (13:36 -0800)]
revert "mm/filemap: add static for function __add_to_page_cache_locked"
Revert commit e6c64253e614 ("mm/filemap: add static for function
__add_to_page_cache_locked") due to incompatibility with
ALLOW_ERROR_INJECTION which result in build errors.
Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com Tested-by: Justin Forbes <jmforbes@linuxtx.org> Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Michal Kubecek <mkubecek@suse.cz> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Tony Luck <tony.luck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Dec 2020 00:36:30 +0000 (16:36 -0800)]
Merge tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"One commit to implement copy_from_kernel_nofault_allowed(), otherwise
copy_from_kernel_nofault() can trigger warnings when accessing bad
addresses in some configurations.
Thanks to Christophe Leroy and Qian Cai"
* tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Fix KUAP warning by providing copy_from_kernel_nofault_allowed()
Linus Torvalds [Thu, 10 Dec 2020 23:36:09 +0000 (15:36 -0800)]
Merge tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Here are a handful more bugfixes for 5.10.
Unfortunately, we found some problems with the new READ_PLUS operation
that aren't easy to fix. We've decided to disable this codepath
through a Kconfig option for now, but a series of patches going into
5.11 will clean up the code and fix the issues at the same time. This
seemed like the best way to go about it.
Summary:
- Fix array overflow when flexfiles mirroring is enabled
- Fix rpcrdma_inline_fixup() crash with new LISTXATTRS
- Fix 5 second delay when doing inter-server copy
- Disable READ_PLUS by default"
* tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Disable READ_PLUS by default
NFSv4.2: Fix 5 seconds delay when doing inter server copy
NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled
2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai.
3) Fix polling in xsk sockets by using sk_poll_wait() instead of
datagram_poll() which keys off of sk_wmem_alloc and such which xsk
sockets do not update. From Xuan Zhuo.
4) Missing init of rekey_data in cfgh80211, from Sara Sharon.
5) Fix destroy of timer before init, from Davide Caratti.
6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd
Bergmann.
7) Missing error return in rtm_to_fib_config() switch case, from Zhang
Changzhong.
8) Fix some src/dest address handling in vrf and add a testcase. From
Stephen Suryaputra.
9) Fix multicast handling in Seville switches driven by mscc-ocelot
driver. From Vladimir Oltean.
10) Fix proto value passed to skb delivery demux in udp, from Xin Long.
11) HW pkt counters not reported correctly in enetc driver, from Claudiu
Manoil.
12) Fix deadlock in bridge, from Joseph Huang.
13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET.
14) Fix pid fetching in bpftool when there are a lot of results, from
Andrii Nakryiko.
15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso.
16) Various stymmac fixes, from Fugang Duan.
17) Fix null deref in tipc, from Cengiz Can.
18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric
Dumazet.
19) Revert a geneve change that likely isnt necessary, from Jakub
Kicinski.
20) Avoid premature rx buffer reuse in various Intel driversm from Björn
Töpel.
21) retain EcT bits during TIS reflection in tcp, from Wei Wang.
22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell.
23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault
24) Fix propagation of 32-bit signed bounds in bpf verifier and add test
cases, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
selftests: fix poll error in udpgro.sh
selftests/bpf: Fix "dubious pointer arithmetic" test
selftests/bpf: Fix array access with signed variable test
selftests/bpf: Add test for signed 32-bit bound check bug
bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver
net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
net/mlx4_en: Handle TX error CQE
net/mlx4_en: Avoid scheduling restart task if it is already running
tcp: fix cwnd-limited bug for TSO deferral where we send nothing
net: flow_offload: Fix memory leak for indirect flow block
tcp: Retain ECT bits for tos reflection
ethtool: fix stack overflow in ethnl_parse_bitset()
e1000e: fix S0ix flow to allow S0i3.2 subset entry
ice: avoid premature Rx buffer reuse
ixgbe: avoid premature Rx buffer reuse
i40e: avoid premature Rx buffer reuse
igb: avoid transmit queue timeout in xdp path
igb: use xdp_do_flush
igb: skb add metasize for xdp
...
Anna Schumaker [Thu, 3 Dec 2020 20:18:39 +0000 (15:18 -0500)]
NFS: Disable READ_PLUS by default
We've been seeing failures with xfstests generic/091 and generic/263
when using READ_PLUS. I've made some progress on these issues, and the
tests fail later on but still don't pass. Let's disable READ_PLUS by
default until we can work out what is going on.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Dai Ngo [Tue, 24 Nov 2020 03:15:17 +0000 (22:15 -0500)]
NFSv4.2: Fix 5 seconds delay when doing inter server copy
Since commit ea6614d1b9795 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.
Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
until after the call to update_open_stateid, to indicate this is the 1st
open. This fix is part of a 2 patches, the other patch is the fix in the
source server to return the stateid for COPY_NOTIFY request with seqid 1
instead of 0.
Fixes: 4026a28991fd ("NFSD add nfs4 inter ssc to nfsd4_copy") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Wed, 25 Nov 2020 00:15:18 +0000 (19:15 -0500)]
NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
By switching to an XFS-backed export, I am able to reproduce the
ibcomp worker crash on my client with xfstests generic/013.
For the failing LISTXATTRS operation, xdr_inline_pages() is called
with page_len=12 and buflen=128.
- When ->send_request() is called, rpcrdma_marshal_req() does not
set up a Reply chunk because buflen is smaller than the inline
threshold. Thus rpcrdma_convert_iovs() does not get invoked at
all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
on the receive buffer.
- During reply processing, rpcrdma_inline_fixup() tries to copy
received data into rq_rcv_buf->pages because page_len is positive.
But there are no receive pages because rpcrdma_marshal_req() never
allocated them.
The result is that the ibcomp worker faults and dies. Sometimes that
causes a visible crash, and sometimes it results in a transport hang
without other symptoms.
RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
should eventually be fixed or replaced. However, my preference is
that upper-layer operations should explicitly allocate their receive
buffers (using GFP_KERNEL) when possible, rather than relying on
XDRBUF_SPARSE_PAGES.
Reported-by: Olga kornievskaia <kolga@netapp.com> Suggested-by: Olga kornievskaia <kolga@netapp.com> Fixes: c426fc53045e ("NFSv4.2: add the extended attribute proc functions.") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Olga kornievskaia <kolga@netapp.com> Reviewed-by: Frank van der Linden <fllinden@amazon.com> Tested-by: Olga kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Paolo Abeni [Wed, 9 Dec 2020 11:21:13 +0000 (12:21 +0100)]
selftests: fix poll error in udpgro.sh
The test program udpgso_bench_rx always invokes the poll()
syscall with a timeout of 10ms. If a larger timeout is specified
via the command line, udpgso_bench_rx is supposed to do multiple
poll() calls till the timeout is expired or an event is received.
Currently the poll() loop errors out after the first invocation with
no events, and may causes self-tests failure alike:
This change addresses the issue allowing the poll() loop to consume
all the configured timeout.
Fixes: 63269f2c3584 ("selftests: fixes for UDP GRO") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests/bpf: Fix "dubious pointer arithmetic" test
The verifier trace changed following a bugfix. After checking the 64-bit
sign, only the upper bit mask is known, not bit 31. Update the test
accordingly.
selftests/bpf: Add test for signed 32-bit bound check bug
After a 32-bit load followed by a branch, the verifier would reduce the
maximum bound of the register to 0x7fffffff, allowing a user to bypass
bound checks. Ensure such a program is rejected.
In the second test, the 64-bit compare should not sufficient to
determine whether the signed 32-bit lower bound is 0, so the verifier
should reject the second branch.
bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
The 64-bit signed bounds should not affect 32-bit signed bounds unless the
verifier knows that upper 32-bits are either all 1s or all 0s. For example the
register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
since smax_value could be larger than 32-bit subregister can hold.
The verifier refines the smax/s32_max return value from certain helpers in
do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
value is also bounded. When both smin and smax bounds fit into 32-bit
subregister the verifier can propagate those bounds.
Linus Torvalds [Thu, 10 Dec 2020 19:00:27 +0000 (11:00 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Two user triggerable crashers and a some EFA related regressions:
- Syzkaller found a bug in CM
- Restore access to the GID table and fix modify_qp for EFA
- Crasher in qedr"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/cm: Fix an attempt to use non-valid pointer when cleaning timewait
RDMA/core: Fix empty gid table for non IB/RoCE devices
RDMA/efa: Use the correct current and new states in modify QP
RDMA/qedr: iWARP invalid(zero) doorbell address fix
Linus Torvalds [Thu, 10 Dec 2020 18:48:49 +0000 (10:48 -0800)]
Merge tag 'media/v5.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"A couple of fixes:
- videobuf2: fix a DMABUF bug, preventing it to properly handle cache
sync/flush
- vidtv: an usage after free and a few sparse/smatch warning fixes
- pulse8-cec: a duplicate free and a bug related to new firmware
usage
- mtk-cir: fix a regression on a clock setting"
* tag 'media/v5.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: vidtv: fix some warnings
media: vidtv: fix kernel-doc markups
media: [next] media: vidtv: fix a read from an object after it has been freed
media: vb2: set cache sync hints when init buffers
media: pulse8-cec: add support for FW v10 and up
media: pulse8-cec: fix duplicate free at disconnect or probe error
media: mtk-cir: fix calculation of chk period
Fixes: eecd9dabea43 ("cls_flower: Support filtering on multiple MPLS Label Stack Entries") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Dec 2020 02:48:29 +0000 (18:48 -0800)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2020-12-09
This series contains updates to igb, ixgbe, i40e, and ice drivers.
Sven Auhagen fixes issues with igb XDP: return correct error value in XDP
xmit back, increase header padding to include space for double VLAN, add
an extack error when Rx buffer is too small for frame size, set metasize if
it is set in xdp, change xdp_do_flush_map to xdp_do_flush, and update
trans_start to avoid possible Tx timeout.
Björn fixes an issue where an Rx buffer can be reused prematurely with
XDP redirect for ixgbe, i40e, and ice drivers.
The following are changes since commit 21d3296b2f106b381b7e5d50b57abe992247bc17:
can: isotp: isotp_setsockopt(): block setsockopt on bound sockets
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 1GbE
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Dec 2020 00:44:35 +0000 (16:44 -0800)]
Merge branch 'mlx4_en-fixes'
Tariq Toukan says:
====================
mlx4_en fixes
This patchset by Moshe contains fixes to the mlx4 Eth driver,
addressing issues in restart flow.
Patch 1 protects the restart task from being rescheduled while active.
Please queue for -stable >= v2.6.
Patch 2 reconstructs SQs stuck in error state, and adds prints for improved
debuggability.
Please queue for -stable >= v3.12.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Wed, 9 Dec 2020 13:03:39 +0000 (15:03 +0200)]
net/mlx4_en: Handle TX error CQE
In case error CQE was found while polling TX CQ, the QP is in error
state and all posted WQEs will generate error CQEs without any data
transmitted. Fix it by reopening the channels, via same method used for
TX timeout handling.
In addition add some more info on error CQE and WQE for debug.
Fixes: e524f9134ebc ("net/mlx4_en: Notify user when TX ring in error state") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Wed, 9 Dec 2020 13:03:38 +0000 (15:03 +0200)]
net/mlx4_en: Avoid scheduling restart task if it is already running
Add restarting state flag to avoid scheduling another restart task while
such task is already running. Change task name from watchdog_task to
restart_task to better fit the task role.
Fixes: b6514a16e725 ("mlx4_en: Fix a race at restart task") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Wed, 9 Dec 2020 03:57:59 +0000 (22:57 -0500)]
tcp: fix cwnd-limited bug for TSO deferral where we send nothing
When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
into persistent scenarios where we have the following sequence:
(1) ACK for full-sized skb of N*MSS arrives
-> tcp_write_xmit() transmit full-sized skb with N*MSS
-> move pacing release time forward
-> exit tcp_write_xmit() because pacing time is in the future
(2) TSQ callback or TCP internal pacing timer fires
-> try to transmit next skb, but TSO deferral finds remainder of
available cwnd is not big enough to trigger an immediate send
now, so we defer sending until the next ACK.
(3) repeat...
So we can get into a case where we never mark ourselves as
cwnd-limited for many seconds at a time, even with
bulk/infinite-backlog senders, because:
o In case (1) above, every time in tcp_write_xmit() we have enough
cwnd to send a full-sized skb, we are not fully using the cwnd
(because cwnd is not a multiple of the TSO skb size). So every time we
send data, we are not cwnd limited, and so in the cwnd-limited
tracking code in tcp_cwnd_validate() we mark ourselves as not
cwnd-limited.
o In case (2) above, every time in tcp_write_xmit() that we try to
transmit the "remainder" of the cwnd but defer, we set the local
variable is_cwnd_limited to true, but we do not send any packets, so
sent_pkts is zero, so we don't call the cwnd-limited logic to update
tp->is_cwnd_limited.
Fixes: fe6cff23010c ("tcp: make cwnd-limited checks measurement-based, and gentler") Reported-by: Ingemar Johansson <ingemar.s.johansson@ericsson.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Chris Mi [Tue, 8 Dec 2020 02:48:35 +0000 (10:48 +0800)]
net: flow_offload: Fix memory leak for indirect flow block
The offending commit introduces a cleanup callback that is invoked
when the driver module is removed to clean up the tunnel device
flow block. But it returns on the first iteration of the for loop.
The remaining indirect flow blocks will never be freed.
Fixes: 1369a329b0de ("net: flow_offload: consolidate indirect flow_block infrastructure") CC: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com>
Wei Wang [Tue, 8 Dec 2020 17:55:08 +0000 (09:55 -0800)]
tcp: Retain ECT bits for tos reflection
For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.
Fixes: d339f69f36db ("tcp: reflect tos value received in SYN to the socket") Reported-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Wei Wang <weiwan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Tue, 8 Dec 2020 22:13:51 +0000 (23:13 +0100)]
ethtool: fix stack overflow in ethnl_parse_bitset()
Syzbot reported a stack overflow in bitmap_from_arr32() called from
ethnl_parse_bitset() when bitset from netlink message is longer than
target bitmap length. While ethnl_compact_sanity_checks() makes sure that
trailing part is all zeros (i.e. the request does not try to touch bits
kernel does not recognize), we also need to cap change_bits to nbits so
that we don't try to write past the prepared bitmaps.
Björn Töpel [Tue, 25 Aug 2020 17:27:36 +0000 (19:27 +0200)]
ice: avoid premature Rx buffer reuse
The page recycle code, incorrectly, relied on that a page fragment
could not be freed inside xdp_do_redirect(). This assumption leads to
that page fragments that are used by the stack/XDP redirect can be
reused and overwritten.
To avoid this, store the page count prior invoking xdp_do_redirect().
Fixes: c5c080d54985 ("ice: Add support for XDP") Reported-and-analyzed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Björn Töpel [Tue, 25 Aug 2020 17:27:35 +0000 (19:27 +0200)]
ixgbe: avoid premature Rx buffer reuse
The page recycle code, incorrectly, relied on that a page fragment
could not be freed inside xdp_do_redirect(). This assumption leads to
that page fragments that are used by the stack/XDP redirect can be
reused and overwritten.
To avoid this, store the page count prior invoking xdp_do_redirect().
Fixes: 57efc450efeb ("ixgbe: add initial support for xdp redirect") Reported-and-analyzed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Björn Töpel [Tue, 25 Aug 2020 17:27:34 +0000 (19:27 +0200)]
i40e: avoid premature Rx buffer reuse
The page recycle code, incorrectly, relied on that a page fragment
could not be freed inside xdp_do_redirect(). This assumption leads to
that page fragments that are used by the stack/XDP redirect can be
reused and overwritten.
To avoid this, store the page count prior invoking xdp_do_redirect().
Longer explanation:
Intel NICs have a recycle mechanism. The main idea is that a page is
split into two parts. One part is owned by the driver, one part might
be owned by someone else, such as the stack.
t0: Page is allocated, and put on the Rx ring
+---------------
used by NIC ->| upper buffer
(rx_buffer) +---------------
| lower buffer
+---------------
page count == USHRT_MAX
rx_buffer->pagecnt_bias == USHRT_MAX
t1: Buffer is received, and passed to the stack (e.g.)
+---------------
| upper buff (skb)
+---------------
used by NIC ->| lower buffer
(rx_buffer) +---------------
page count == USHRT_MAX
rx_buffer->pagecnt_bias == USHRT_MAX - 1
t2: Buffer is received, and redirected
+---------------
| upper buff (skb)
+---------------
used by NIC ->| lower buffer
(rx_buffer) +---------------
This means that buffer *cannot* be flipped/reused, because the skb is
still using it.
The problem arises when xdp_do_redirect() actually frees the
segment. Then we get:
page count == USHRT_MAX - 1
rx_buffer->pagecnt_bias == USHRT_MAX - 2
From a recycle perspective, the buffer can be flipped and reused,
which means that the skb data area is passed to the Rx HW ring!
To work around this, the page count is stored prior calling
xdp_do_redirect().
Note that this is not optimal, since the NIC could actually reuse the
"lower buffer" again. However, then we need to track whether
XDP_REDIRECT consumed the buffer or not.
Fixes: 53e1001a6e32 ("i40e: add support for XDP_REDIRECT") Reported-and-analyzed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sven Auhagen [Wed, 11 Nov 2020 17:04:53 +0000 (18:04 +0100)]
igb: avoid transmit queue timeout in xdp path
Since we share the transmit queue with the network stack,
it is possible that we run into a transmit queue timeout.
This will reset the queue.
This happens under high load when XDP is using the
transmit queue pretty much exclusively.
netdev_start_xmit() sets the trans_start variable of the
transmit queue to jiffies which is later utilized by dev_watchdog(),
so to avoid timeout, let stack know that XDP xmit happened by
bumping the trans_start within XDP Tx routines to jiffies.
Fixes: d0049eb03c6f ("igb: add XDP support") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Linus Torvalds [Wed, 9 Dec 2020 22:49:48 +0000 (14:49 -0800)]
Merge tag 'arm-soc-fixes-v5.10-4b' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"There are a few more PHY mode changes for allwinner SoC based boards
with a Realtek PHY after the driver changed its behavior, I assume
there will be more of these in the future. Also on for Allwinner, the
Banana Pi M2 board had a regression that led to some devices not
working because of a slightly incorrect voltage being applied.
By popular demand, I picked up a change from Krzysztof Kozlowski to
actually list the SoC tree in the MAINTAINERS file. We don't want to
get Cc'd on normal patches that are picked up by platform maintainers,
but the lack of an entry has led to confusion in the past.
All the other changes are fairly benign, fixing boot-time or
compile-time warning messages in various places:
- A dtc warning on the OLPC XO-1.75
- A boot-time warning on i.MX6 wandboard
- A harmless compile-time warning
- A regression causing one of the i.MX6 SoCs to be identified as
another
- Missing SoC identification of Allwinner V3 and S3"
* tag 'arm-soc-fixes-v5.10-4b' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
firmware: xilinx: Mark pm_api_features_map with static keyword
ARM: dts: mmp2-olpc-xo-1-75: clear the warnings when make dtbs
MAINTAINERS: add a limited ARM and ARM64 SoC entry
MAINTAINERS: correct SoC Git address (formerly: arm-soc)
ARM: keystone: remove SECTION_SIZE_BITS/MAX_PHYSMEM_BITS
arm64: dts: allwinner: H5: NanoPi Neo Plus2: phy-mode rgmii-id
arm64: dts: allwinner: A64 Sopine: phy-mode rgmii-id
ARM: dts: imx6qdl-kontron-samx6i: fix I2C_PM scl pin
ARM: dts: imx6qdl-wandboard-revd1: Remove PAD_GPIO_6 from enetgrp
ARM: imx: Use correct SRC base address
ARM: dts: sun7i: pcduino3-nano: enable RGMII RX/TX delay on PHY
ARM: dts: sun8i: v3s: fix GIC node memory range
ARM: dts: sun8i: v40: bananapi-m2-berry: Fix ethernet node
ARM: dts: sun8i: r40: bananapi-m2-berry: Fix dcdc1 regulator
ARM: dts: sun7i: bananapi: Enable RGMII RX/TX delay on Ethernet PHY
ARM: dts: s3: pinecube: align compatible property to other S3 boards
ARM: sunxi: Add machine match for the Allwinner V3 SoC
arm64: dts: allwinner: h6: orangepi-one-plus: Fix ethernet
Jakub Kicinski [Wed, 9 Dec 2020 22:39:56 +0000 (14:39 -0800)]
Revert "geneve: pull IP header before ECN decapsulation"
This reverts commit 617aa603f376 ("geneve: pull IP header before ECN decapsulation").
Eric says: "network header should have been pulled already before
hitting geneve_rx()". Let's revert the syzbot fix since it's causing
more harm than good, and revisit.
Zhen Lei [Mon, 7 Dec 2020 08:47:52 +0000 (16:47 +0800)]
ARM: dts: mmp2-olpc-xo-1-75: clear the warnings when make dtbs
The check_spi_bus_bridge() in scripts/dtc/checks.c requires that the node
have "spi-slave" property must with "#address-cells = <0>" and
"#size-cells = <0>". But currently both "#address-cells" and "#size-cells"
properties are deleted, the corresponding default values are 2 and 1. As a
result, the check fails and below warnings is displayed.
arch/arm/boot/dts/mmp2.dtsi:472.23-480.6: Warning (spi_bus_bridge): \
/soc/apb@d4000000/spi@d4037000: incorrect #address-cells for SPI bus
also defined at arch/arm/boot/dts/mmp2-olpc-xo-1-75.dts:225.7-237.3
arch/arm/boot/dts/mmp2.dtsi:472.23-480.6: Warning (spi_bus_bridge): \
/soc/apb@d4000000/spi@d4037000: incorrect #size-cells for SPI bus
also defined at arch/arm/boot/dts/mmp2-olpc-xo-1-75.dts:225.7-237.3
arch/arm/boot/dts/mmp2-olpc-xo-1-75.dtb: Warning (spi_bus_reg): \
Failed prerequisite 'spi_bus_bridge'
Because the value of "#size-cells" is already defined as zero in the node
"ssp3: spi@d4037000" in arch/arm/boot/dts/mmp2.dtsi. So we only need to
explicitly add "#address-cells = <0>" and keep "#size-cells" no change.
Fixes: bd2bd442219d ("[PATCH] IB: Add the kernel CM implementation") Link: https://lore.kernel.org/r/20201204064205.145795-1-leon@kernel.org Reviewed-by: Maor Gottlieb <maorg@nvidia.com> Reported-by: Amit Matityahu <mitm@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Linus Torvalds [Wed, 9 Dec 2020 17:59:14 +0000 (09:59 -0800)]
Merge tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull iommu fix from Will Deacon:
"Fix interrupt table length definition for AMD IOMMU.
It's actually a fix for a fix, where the size of the interrupt
remapping table was increased but a related constant for the
size of the interrupt table was forgotten"
* tag 'iommu-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs
Oliver Hartkopp [Fri, 4 Dec 2020 13:35:07 +0000 (14:35 +0100)]
can: isotp: isotp_setsockopt(): block setsockopt on bound sockets
The isotp socket can be widely configured in its behaviour regarding addressing
types, fill-ups, receive pattern tests and link layer length. Usually all
these settings need to be fixed before bind() and can not be changed
afterwards.
This patch adds a check to enforce the common usage pattern.
Daniel Borkmann [Wed, 9 Dec 2020 15:27:42 +0000 (16:27 +0100)]
Merge branch 'bpf-xdp-offload-fixes'
Toke Høiland-Jørgensen says:
====================
This series restores the test_offload.py selftest to working order. It seems a
number of subtle behavioural changes have crept into various subsystems which
broke test_offload.py in a number of ways. Most of these are fairly benign
changes where small adjustments to the test script seems to be the best fix,
but one is an actual kernel bug that I've observed in the wild caused by a bad
interaction between xdp_attachment_flags_ok() and the rework of XDP program
handling in the core netdev code.
Patch 1 fixes the bug by removing xdp_attachment_flags_ok(), and the reminder of
the patches are adjustments to test_offload.py, including a new feature for
netdevsim to force a BPF verification fail. Please see the individual patches
for details.
Changelog:
v4:
- Accidentally truncated the Fixes: hashes in patches 3/4 to 11 chars
v3:
- Add Fixes: tags
v2:
- Replace xdp_attachment_flags_ok() with a check in dev_xdp_attach()
- Better packing of struct nsim_dev
====================
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
selftests/bpf/test_offload.py: Filter bpftool internal map when counting maps
A few of the tests in test_offload.py expects to see a certain number of
maps created, and checks this by counting the number of maps returned by
bpftool. There is already a filter that will remove any maps already there
at the beginning of the test, but bpftool now creates a map for the PID
iterator rodata on each invocation, which makes the map count wrong. Fix
this by also filtering the pid_iter.rodata map by name when counting.
Fixes: 5a6b4f611c07 ("tools/bpftool: Show info for processes holding BPF map/prog/link/btf FDs") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/160752226387.110217.9887866138149423444.stgit@toke.dk
selftests/bpf/test_offload.py: Reset ethtool features after failed setting
When setting the ethtool feature flag fails (as expected for the test), the
kernel now tracks that the feature was requested to be 'off' and refuses to
subsequently disable it again. So reset it back to 'on' so a subsequent
disable (that's not supposed to fail) can succeed.
selftests/bpf/test_offload.py: Fix expected case of extack messages
Commit 10dc71d50bb8 ("bpf, xdp: Maintain info on attached XDP BPF programs
in net_device") changed the case of some of the extack messages being
returned when attaching of XDP programs failed. This broke test_offload.py,
so let's fix the test to reflect this.
Fixes: 10dc71d50bb8 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/160752226175.110217.11214100824416344952.stgit@toke.dk
selftests/bpf/test_offload.py: Only check verifier log on verification fails
Since commit 31af6335dc4d ("bpf: Make verifier log more relevant by
default"), the verifier discards log messages for successfully-verified
programs. This broke test_offload.py which is looking for a verification
message from the driver callback. Change test_offload.py to use the toggle
in netdevsim to make the verification fail before looking for the
verification message.
netdevsim: Add debugfs toggle to reject BPF programs in verifier
This adds a new debugfs toggle ('bpf_bind_verifier_accept') that can be
used to make netdevsim reject BPF programs from being accepted by the
verifier. If this toggle (which defaults to true) is set to false,
nsim_bpf_verify_insn() will return EOPNOTSUPP on the last
instruction (after outputting the 'Hello from netdevsim' verifier message).
This makes it possible to check the verification callback in the driver
from test_offload.py in selftests, since the verifier now clears the
verifier log on a successful load, hiding the message from the driver.
selftests/bpf/test_offload.py: Remove check for program load flags match
Since we just removed the xdp_attachment_flags_ok() callback, also remove
the check for it in test_offload.py, and replace it with a test for the new
ambiguity-avoid check when multiple programs are loaded.
Fixes: 10dc71d50bb8 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/160752225858.110217.13036901876869496246.stgit@toke.dk
xdp: Remove the xdp_attachment_flags_ok() callback
Since commit 10dc71d50bb8 ("bpf, xdp: Maintain info on attached XDP BPF
programs in net_device"), the XDP program attachment info is now maintained
in the core code. This interacts badly with the xdp_attachment_flags_ok()
check that prevents unloading an XDP program with different load flags than
it was loaded with. In practice, two kinds of failures are seen:
- An XDP program loaded without specifying a mode (and which then ends up
in driver mode) cannot be unloaded if the program mode is specified on
unload.
- The dev_xdp_uninstall() hook always calls the driver callback with the
mode set to the type of the program but an empty flags argument, which
means the flags_ok() check prevents the program from being removed,
leading to bpf prog reference leaks.
The original reason this check was added was to avoid ambiguity when
multiple programs were loaded. With the way the checks are done in the core
now, this is quite simple to enforce in the core code, so let's add a check
there and get rid of the xdp_attachment_flags_ok() callback entirely.
Fixes: 10dc71d50bb8 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device") Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/bpf/160752225751.110217.10267659521308669050.stgit@toke.dk
netfilter: nft_ct: Remove confirmation check for NFT_CT_ID
Since commit 0037db0314d2 ("netfilter: conntrack: Use consistent ct id
hash calculation") the ct id will not change from initialization to
confirmation. Removing the confirmation check allows for things like
adding an element to a 'typeof ct id' set in prerouting upon reception
of the first packet of a new connection, and then being able to
reference that set consistently both before and after the connection
is confirmed.
Fixes: 0037db0314d2 ("netfilter: conntrack: Use consistent ct id hash calculation") Signed-off-by: Brett Mastbergen <brett.mastbergen@gmail.com> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Minchan Kim [Wed, 9 Dec 2020 04:57:18 +0000 (20:57 -0800)]
mm/madvise: remove racy mm ownership check
Jann spotted the security hole due to race of mm ownership check.
If the task is sharing the mm_struct but goes through execve() before
mm_access(), it could skip process_madvise_behavior_valid check. That
makes *any advice hint* to reach into the remote process.
This patch removes the mm ownership check. With it, it will lose the
ability that local process could give *any* advice hint with vector
interface for some reason (e.g., performance). Since there is no
concrete example in upstream yet, it would be better to remove the
abiliity at this moment and need to review when such new advice comes
up.
Eric Dumazet [Tue, 8 Dec 2020 16:21:31 +0000 (08:21 -0800)]
tcp: select sane initial rcvq_space.space for big MSS
Before commit de1c45f393bd ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB")
small tcp_rmem[1] values were overridden by tcp_fixup_rcvbuf() to accommodate various MSS.
This is no longer the case, and Hazem Mohamed Abuelfotoh reported
that DRS would not work for MTU 9000 endpoints receiving regular (1500 bytes) frames.
Root cause is that tcp_init_buffer_space() uses tp->rcv_wnd for upper limit
of rcvq_space.space computation, while it can select later a smaller
value for tp->rcv_ssthresh and tp->window_clamp.
Based on an initial report and patch from Hazem Mohamed Abuelfotoh
https://lore.kernel.org/netdev/20201204180622.14285-1-abuehaze@amazon.com/
Fixes: de1c45f393bd ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Fixes: 8ecaef87a7bc ("tcp: start receiver buffer autotuning sooner") Reported-by: Hazem Mohamed Abuelfotoh <abuehaze@amazon.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: ll_temac: Fix potential NULL dereference in temac_probe()
platform_get_resource() may fail and in this case a NULL dereference
will occur.
Fix it to use devm_platform_ioremap_resource() instead of calling
platform_get_resource() and devm_ioremap().
This is detected by Coccinelle semantic patch.
@@
expression pdev, res, n, t, e, e1, e2;
@@
res = \(platform_get_resource\|platform_get_resource_byname\)(pdev, t, n);
+ if (!res)
+ return -EINVAL;
... when != res == NULL
e = devm_ioremap(e1, res->start, e2);
Fixes: 89af02da291a ("net: ll_temac: Extend support to non-device-tree platforms") Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com> Acked-by: Esben Haabendal <esben@geanix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Howells [Tue, 8 Dec 2020 23:52:03 +0000 (23:52 +0000)]
afs: Fix memory leak when mounting with multiple source parameters
There's a memory leak in afs_parse_source() whereby multiple source=
parameters overwrite fc->source in the fs_context struct without freeing
the previously recorded source.
Fix this by only permitting a single source parameter and rejecting with
an error all subsequent ones.
This was caught by syzbot with the kernel memory leak detector, showing
something like the following trace:
It turns out that it causes long boot-time latencies (to the point of
timeouts and failed boots).
The cause is the increase in request queues, and a fix for that is
queued up for 5.11, but we're reverting this commit that triggered the
problem for now.
Fugang Duan [Mon, 7 Dec 2020 10:51:41 +0000 (18:51 +0800)]
net: stmmac: overwrite the dma_cap.addr64 according to HW design
The current IP register MAC_HW_Feature1[ADDR64] only defines
32/40/64 bit width, but some SOCs support others like i.MX8MP
support 34 bits but it maps to 40 bits width in MAC_HW_Feature1[ADDR64].
So overwrite dma_cap.addr64 according to HW real design.
Fixes: a9b664a71af4 ("net: ethernet: dwmac: add ethernet glue logic for NXP imx8 chip") Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fugang Duan [Mon, 7 Dec 2020 10:51:40 +0000 (18:51 +0800)]
net: stmmac: delete the eee_ctrl_timer after napi disabled
There have chance to re-enable the eee_ctrl_timer and fire the timer
in napi callback after delete the timer in .stmmac_release(), which
introduces to access eee registers in the timer function after clocks
are disabled then causes system hang. Found this issue when do
suspend/resume and reboot stress test.
It is safe to delete the timer after napi disabled and disable lpi mode.
Fixes: 6d1634718fa8d ("stmmac: add the Energy Efficient Ethernet support") Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fugang Duan [Mon, 7 Dec 2020 10:51:39 +0000 (18:51 +0800)]
net: stmmac: free tx skb buffer in stmmac_resume()
When do suspend/resume test, there have WARN_ON() log dump from
stmmac_xmit() funciton, the code logic:
entry = tx_q->cur_tx;
first_entry = entry;
WARN_ON(tx_q->tx_skbuff[first_entry]);
In normal case, tx_q->tx_skbuff[txq->cur_tx] should be NULL because
the skb should be handled and freed in stmmac_tx_clean().
But stmmac_resume() reset queue parameters like below, skb buffers
may not be freed.
tx_q->cur_tx = 0;
tx_q->dirty_tx = 0;
So free tx skb buffer in stmmac_resume() to avoid warning and
memory leak.
Fixes: a271220f65991 ("net: add support for STMicroelectronics Ethernet controllers.") Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fugang Duan [Mon, 7 Dec 2020 10:51:38 +0000 (18:51 +0800)]
net: stmmac: start phylink instance before stmmac_hw_setup()
Start phylink instance and resume back the PHY to supply
RX clock to MAC before MAC layer initialization by calling
.stmmac_hw_setup(), since DMA reset depends on the RX clock,
otherwise DMA reset cost maximum timeout value then finally
timeout.
Fixes: 7371d799f465 ("net: stmmac: Convert to phylink and remove phylib logic") Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fugang Duan [Mon, 7 Dec 2020 10:51:37 +0000 (18:51 +0800)]
net: stmmac: increase the timeout for dma reset
Current timeout value is not enough for gmac5 dma reset
on imx8mp platform, increase the timeout range.
Signed-off-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: Joakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
netfilter: nftables: comment indirect serialization of commit_mutex with rtnl_mutex
Add an explicit comment in the code to describe the indirect
serialization of the holders of the commit_mutex with the rtnl_mutex.
Commit ca09846daf2e ("netfilter: nf_tables: do not hold reference on
netdevice from preparation phase") already describes this, but a comment
in this case is better for reference.
Reported-by: Vladimir Oltean <olteanv@gmail.com> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nft_dynset: fix timeouts later than 23 days
Use nf_msecs_to_jiffies64 and nf_jiffies64_to_msecs as provided by 39c838ffcf9a ("netfilter: nf_tables: support timeouts larger than 23
days"), otherwise ruleset listing breaks.
Fixes: ba373cecf896 ("netfilter: nft_dynset: fix element timeout for HZ != 1000") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jarod Wilson [Sat, 5 Dec 2020 17:22:29 +0000 (12:22 -0500)]
bonding: fix feature flag setting at init time
Don't try to adjust XFRM support flags if the bond device isn't yet
registered. Bad things can currently happen when netdev_change_features()
is called without having wanted_features fully filled in yet. This code
runs both on post-module-load mode changes, as well as at module init
time, and when run at module init time, it is before register_netdevice()
has been called and filled in wanted_features. The empty wanted_features
led to features also getting emptied out, which was definitely not the
intended behavior, so prevent that from happening.
Originally, I'd hoped to stop adjusting wanted_features at all in the
bonding driver, as it's documented as being something only the network
core should touch, but we actually do need to do this to properly update
both the features and wanted_features fields when changing the bond type,
or we get to a situation where ethtool sees:
esp-hw-offload: off [requested on]
I do think we should be using netdev_update_features instead of
netdev_change_features here though, so we only send notifiers when the
features actually changed.
Fixes: 4a36c3eccd30 ("bonding: allow xfrm offload setup post-module-load") Reported-by: Ivan Vecera <ivecera@redhat.com> Suggested-by: Ivan Vecera <ivecera@redhat.com> Cc: Jay Vosburgh <j.vosburgh@gmail.com> Cc: Veaceslav Falico <vfalico@gmail.com> Cc: Andy Gospodarek <andy@greyhouse.net> Signed-off-by: Jarod Wilson <jarod@redhat.com> Link: https://lore.kernel.org/r/20201205172229.576587-1-jarod@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Andrii Nakryiko [Fri, 4 Dec 2020 23:20:01 +0000 (15:20 -0800)]
tools/bpftool: Fix PID fetching with a lot of results
In case of having so many PID results that they don't fit into a singe page
(4096) bytes, bpftool will erroneously conclude that it got corrupted data due
to 4096 not being a multiple of struct pid_iter_entry, so the last entry will
be partially truncated. Fix this by sizing the buffer to fit exactly N entries
with no truncation in the middle of record.
Fixes: 5a6b4f611c07 ("tools/bpftool: Show info for processes holding BPF map/prog/link/btf FDs") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20201204232002.3589803-1-andrii@kernel.org
netfilter: x_tables: Switch synchronization to RCU
When running concurrent iptables rules replacement with data, the per CPU
sequence count is checked after the assignment of the new information.
The sequence count is used to synchronize with the packet path without the
use of any explicit locking. If there are any packets in the packet path using
the table information, the sequence count is incremented to an odd value and
is incremented to an even after the packet process completion.
The new table value assignment is followed by a write memory barrier so every
CPU should see the latest value. If the packet path has started with the old
table information, the sequence counter will be odd and the iptables
replacement will wait till the sequence count is even prior to freeing the
old table info.
However, this assumes that the new table information assignment and the memory
barrier is actually executed prior to the counter check in the replacement
thread. If CPU decides to execute the assignment later as there is no user of
the table information prior to the sequence check, the packet path in another
CPU may use the old table information. The replacement thread would then free
the table information under it leading to a use after free in the packet
processing context-
Unable to handle kernel NULL pointer dereference at virtual
address 000000000000008e
pc : ip6t_do_table+0x5d0/0x89c
lr : ip6t_do_table+0x5b8/0x89c
ip6t_do_table+0x5d0/0x89c
ip6table_filter_hook+0x24/0x30
nf_hook_slow+0x84/0x120
ip6_input+0x74/0xe0
ip6_rcv_finish+0x7c/0x128
ipv6_rcv+0xac/0xe4
__netif_receive_skb+0x84/0x17c
process_backlog+0x15c/0x1b8
napi_poll+0x88/0x284
net_rx_action+0xbc/0x23c
__do_softirq+0x20c/0x48c
This could be fixed by forcing instruction order after the new table
information assignment or by switching to RCU for the synchronization.
Fixes: 1e6d83d058ff ("netfilter: x_tables: make xt_replace_table wait until old rules are not used anymore") Reported-by: Sean Tranchetti <stranche@codeaurora.org> Reported-by: kernel test robot <lkp@intel.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
1) Sysbot reported fixes for the new 64/32 bit compat layer.
From Dmitry Safonov.
2) Fix a memory leak in xfrm_user_policy that was introduced
by adding the 64/32 bit compat layer. From Yu Kuai.
* 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec:
net: xfrm: fix memory leak in xfrm_user_policy()
xfrm/compat: Don't allocate memory with __GFP_ZERO
xfrm/compat: memset(0) 64-bit padding at right place
xfrm/compat: Translate by copying XFRMA_UNSPEC attribute
====================
net: stmmac: dwmac-meson8b: fix mask definition of the m250_sel mux
The m250_sel mux clock uses bit 4 in the PRG_ETH0 register. Fix this by
shifting the PRG_ETH0_CLK_M250_SEL_MASK accordingly as the "mask" in
struct clk_mux expects the mask relative to the "shift" field in the
same struct.
While here, get rid of the PRG_ETH0_CLK_M250_SEL_SHIFT macro and use
__ffs() to determine it from the existing PRG_ETH0_CLK_M250_SEL_MASK
macro.
Fixes: 5b97865dfed54e ("net: stmmac: add a glue driver for the Amlogic Meson 8b / GXBB DWMAC") Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com> Reviewed-by: Jerome Brunet <jbrunet@baylibre.com> Link: https://lore.kernel.org/r/20201205213207.519341-1-martin.blumenstingl@googlemail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianguo Wu [Sat, 5 Dec 2020 07:56:33 +0000 (15:56 +0800)]
mptcp: print new line in mptcp_seq_show() if mptcp isn't in use
When do cat /proc/net/netstat, the output isn't append with a new line, it looks like this:
[root@localhost ~]# cat /proc/net/netstat
...
MPTcpExt: 0 0 0 0 0 0 0 0 0 0 0 0 0[root@localhost ~]#
This is because in mptcp_seq_show(), if mptcp isn't in use, net->mib.mptcp_statistics is NULL,
so it just puts all 0 after "MPTcpExt:", and return, forgot the '\n'.
Joseph Huang [Fri, 4 Dec 2020 23:56:28 +0000 (18:56 -0500)]
bridge: Fix a deadlock when enabling multicast snooping
When enabling multicast snooping, bridge module deadlocks on multicast_lock
if 1) IPv6 is enabled, and 2) there is an existing querier on the same L2
network.
The deadlock was caused by the following sequence: While holding the lock,
br_multicast_open calls br_multicast_join_snoopers, which eventually causes
IP stack to (attempt to) send out a Listener Report (in igmp6_join_group).
Since the destination Ethernet address is a multicast address, br_dev_xmit
feeds the packet back to the bridge via br_multicast_rcv, which in turn
calls br_multicast_add_group, which then deadlocks on multicast_lock.
The fix is to move the call br_multicast_join_snoopers outside of the
critical section. This works since br_multicast_join_snoopers only deals
with IP and does not modify any multicast data structures of the bridge,
so there's no need to hold the lock.
Steps to reproduce:
1. sysctl net.ipv6.conf.all.force_mld_version=1
2. have another querier
3. ip link set dev bridge type bridge mcast_snooping 0 && \
ip link set dev bridge type bridge mcast_snooping 1 < deadlock >
Claudiu Manoil [Fri, 4 Dec 2020 17:15:05 +0000 (19:15 +0200)]
enetc: Fix reporting of h/w packet counters
Noticed some inconsistencies in packet statistics reporting.
This patch adds the missing Tx packet counter registers to
ethtool reporting and fixes the information strings for a
few of them.
powerpc/mm: Fix KUAP warning by providing copy_from_kernel_nofault_allowed()
Since commit 2527a7385307 ("powerpc: use non-set_fs based maccess
routines"), userspace access is not granted anymore when using
copy_from_kernel_nofault()
However, kthread_probe_data() uses copy_from_kernel_nofault()
to check validity of pointers. When the pointer is NULL,
it points to userspace, leading to a KUAP fault and triggering
the following big hammer warning many times when you request
a sysrq "show task":
To avoid that, copy_from_kernel_nofault_allowed() is used to check
whether the address is a valid kernel address. But the default
version of it returns true for any address.
Provide a powerpc version of copy_from_kernel_nofault_allowed()
that returns false when the address is below TASK_USER_MAX,
so that copy_from_kernel_nofault() will return -ERANGE.
Gal Pressman [Sun, 6 Dec 2020 15:32:38 +0000 (17:32 +0200)]
RDMA/core: Fix empty gid table for non IB/RoCE devices
The query_gid_table ioctl skips non IB/RoCE ports, which as a result
returns an empty gid table for devices such as EFA which have a GID table,
but are not IB/RoCE.
Dongdong Wang [Sat, 5 Dec 2020 07:59:45 +0000 (23:59 -0800)]
lwt: Disable BH too in run_lwt_bpf()
The per-cpu bpf_redirect_info is shared among all skb_do_redirect()
and BPF redirect helpers. Callers on RX path are all in BH context,
disabling preemption is not sufficient to prevent BH interruption.
In production, we observed strange packet drops because of the race
condition between LWT xmit and TC ingress, and we verified this issue
is fixed after we disable BH.
Although this bug was technically introduced from the beginning, that
is commit 8099c1b96ed6 ("bpf: BPF for lightweight tunnel infrastructure"),
at that time call_rcu() had to be call_rcu_bh() to match the RCU context.
So this patch may not work well before RCU flavor consolidation has been
completed around v5.0.
Update the comments above the code too, as call_rcu() is now BH friendly.
Linus Torvalds [Mon, 7 Dec 2020 19:20:26 +0000 (11:20 -0800)]
Merge tag 'trace-v5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fix from Steven Rostedt:
"Fix userstacktrace option for instances
While writing an application that requires user stack trace option to
work in instances, I found that the instance option has a bug that
makes it a nop. The check for performing the user stack trace in an
instance, checks the top level options (not the instance options) to
determine if a user stack trace should be performed or not.
This is not only incorrect, but also confusing for users. It confused
me for a bit!"
* tag 'trace-v5.10-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Fix userstacktrace option for instances
MAINTAINERS: add a limited ARM and ARM64 SoC entry
It is expected for ARM and ARM64 SoC related code to go through
sub-architecture maintainers. Their addresses were therefore not
documented to push patch traffic through sub-architecture maintainers.
However when patches touch generic code, e.g. multi_v7_defconfig, the
patch might not be picked up by them and instead should go to the SoC
maintainers - Arnd and Olof.
Add a minimal maintainer's entry for SoC covering only Makefile, so it
will not appear on most of submissions (except new devicetree boards).
It will though serve as a documentation and reference for cases when
submitter does not know where to send his SoC-related patches.
These definitions are evidently left over from the days when
sparsemem settings were platform specific. This was no longer
the case when the platform got merged.
There was no warning in the past, but now the asm/sparsemem.h
header ends up being included indirectly, causing this warning:
In file included from /git/arm-soc/arch/arm/mach-keystone/keystone.c:24:
arch/arm/mach-keystone/memory.h:10:9: warning: 'SECTION_SIZE_BITS' macro redefined [-Wmacro-redefined]
#define SECTION_SIZE_BITS 34
^
arch/arm/include/asm/sparsemem.h:23:9: note: previous definition is here
#define SECTION_SIZE_BITS 28
^
Clearly the definitions never had any effect here, so remove them.
Arnd Bergmann [Mon, 7 Dec 2020 14:28:59 +0000 (15:28 +0100)]
Merge tag 'imx-fixes-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux into arm/fixes
i.MX fixes for 5.10, round 5:
- Fix a regression on SoC revision detection with ANATOP, that is
introduced by commit 5a5eefb5e031 ("ARM: imx: Add missing
of_node_put()").
- Drop PAD_GPIO_6 from imx6qdl-wandboard ENET pin group, as the pin is
used by camera sensor now.
- Fix I2C3_SCL pinmux on imx6qdl-kontron-samx6i board, so that SoM EEPROM
can be accessed.
* tag 'imx-fixes-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/shawnguo/linux:
ARM: dts: imx6qdl-kontron-samx6i: fix I2C_PM scl pin
ARM: dts: imx6qdl-wandboard-revd1: Remove PAD_GPIO_6 from enetgrp
ARM: imx: Use correct SRC base address
iommu/amd: Set DTE[IntTabLen] to represent 512 IRTEs
According to the AMD IOMMU spec, the commit 84c25bc1f617
("iommu/amd: Increase interrupt remapping table limit to 512 entries")
also requires the interrupt table length (IntTabLen) to be set to 9
(power of 2) in the device table mapping entry (DTE).
Fixes: 84c25bc1f617 ("iommu/amd: Increase interrupt remapping table limit to 512 entries") Reported-by: Jerry Snitselaar <jsnitsel@redhat.com> Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Reviewed-by: Jerry Snitselaar <jsnitsel@redhat.com> Link: https://lore.kernel.org/r/20201207091920.3052-1-suravee.suthikulpanit@amd.com Signed-off-by: Will Deacon <will@kernel.org>
Xin Long [Mon, 7 Dec 2020 07:55:40 +0000 (15:55 +0800)]
udp: fix the proto value passed to ip_protocol_deliver_rcu for the segments
Guillaume noticed that: for segments udp_queue_rcv_one_skb() returns the
proto, and it should pass "ret" unmodified to ip_protocol_deliver_rcu().
Otherwize, with a negtive value passed, it will underflow inet_protos.
This can be reproduced with IPIP FOU:
# ip fou add port 5555 ipproto 4
# ethtool -K eth1 rx-gro-list on
Fixes: 6b9bfa80f245 ("udp: cope with UDP GRO packet misdirection") Reported-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Huazhong Tan [Mon, 7 Dec 2020 07:20:25 +0000 (15:20 +0800)]
net: hns3: remove a misused pragma packed
hclge_dbg_reg_info[] is defined as an array of packed structure
accidentally. However, this array contains pointers, which are
no longer aligned naturally, and cannot be relocated on PPC64.
Hence, when compile-testing this driver on PPC64 with
CONFIG_RELOCATABLE=y (e.g. PowerPC allyesconfig), there will be
some warnings.
Since each field in structure hclge_qos_pri_map_cmd and
hclge_dbg_bitmap_cmd is type u8, the pragma packed is unnecessary
for these two structures as well, so remove the pragma packed in
hclge_debugfs.h to fix this issue, and this increases
hclge_dbg_reg_info[] by 4 bytes per entry.
Fixes: ee2ccc76874c ("net: hns3: code optimization for debugfs related to "dump reg"") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Huazhong Tan <tanhuazhong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>