Eric Dumazet [Mon, 15 Nov 2021 17:11:50 +0000 (09:11 -0800)]
net: drop nopreempt requirement on sock_prot_inuse_add()
This is distracting really, let's make this simpler,
because many callers had to take care of this
by themselves, even if on x86 this adds more
code than really needed.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 16 Nov 2021 13:10:35 +0000 (13:10 +0000)]
Merge branch 'tcp-optimizations'
Eric Dumazet says:
====================
tcp: optimizations for linux-5.17
Mostly small improvements in this series.
The notable change is in "defer skb freeing after
socket lock is released" in recvmsg() (and RX zerocopy)
The idea is to try to let skb freeing to BH handler,
whenever possible, or at least perform the freeing
outside of the socket lock section, for much improved
performance. This idea can probably be extended
to other protocols.
Tests on a 100Gbit NIC
Max throughput for one TCP_STREAM flow, over 10 runs.
MTU : 1500 (1428 bytes of TCP payload per MSS)
Before: 55 Gbit
After: 66 Gbit
MTU : 4096+ (4096 bytes of TCP payload, plus TCP/IPv6 headers)
Before: 82 Gbit
After: 95 Gbit
====================
Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Nov 2021 19:02:48 +0000 (11:02 -0800)]
tcp: do not call tcp_cleanup_rbuf() if we have a backlog
Under pressure, tcp recvmsg() has logic to process the socket backlog,
but calls tcp_cleanup_rbuf() right before.
Avoiding sending ACK right before processing new segments makes
a lot of sense, as this decrease the number of ACK packets,
with no impact on effective ACK clocking.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Nov 2021 19:02:46 +0000 (11:02 -0800)]
tcp: defer skb freeing after socket lock is released
tcp recvmsg() (or rx zerocopy) spends a fair amount of time
freeing skbs after their payload has been consumed.
A typical ~64KB GRO packet has to release ~45 page
references, eventually going to page allocator
for each of them.
Currently, this freeing is performed while socket lock
is held, meaning that there is a high chance that
BH handler has to queue incoming packets to tcp socket backlog.
This can cause additional latencies, because the user
thread has to process the backlog at release_sock() time,
and while doing so, additional frames can be added
by BH handler.
This patch adds logic to defer these frees after socket
lock is released, or directly from BH handler if possible.
Being able to free these skbs from BH handler helps a lot,
because this avoids the usual alloc/free assymetry,
when BH handler and user thread do not run on same cpu or
NUMA node.
One cpu can now be fully utilized for the kernel->user copy,
and another cpu is handling BH processing and skb/page
allocs/frees (assuming RFS is not forcing use of a single CPU)
Tested:
100Gbit NIC
Max throughput for one TCP_STREAM flow, over 10 runs
MTU : 1500
Before: 55 Gbit
After: 66 Gbit
MTU : 4096+(headers)
Before: 82 Gbit
After: 95 Gbit
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Nov 2021 19:02:45 +0000 (11:02 -0800)]
tcp: avoid indirect calls to sock_rfree
TCP uses sk_eat_skb() when skbs can be removed from receive queue.
However, the call to skb_orphan() from __kfree_skb() incurs
an indirect call so sock_rfee(), which is more expensive than
a direct call, especially for CONFIG_RETPOLINE=y.
Add tcp_eat_recv_skb() function to make the call before
__kfree_skb().
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Nov 2021 19:02:43 +0000 (11:02 -0800)]
tcp: annotate races around tp->urg_data
tcp_poll() and tcp_ioctl() are reading tp->urg_data without socket lock
owned.
Also, it is faster to first check tp->urg_data in tcp_poll(),
then tp->urg_seq == tp->copied_seq, because tp->urg_seq is
located in a different/cold cache line.
Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Mon, 15 Nov 2021 19:02:31 +0000 (11:02 -0800)]
tcp: remove dead code in __tcp_v6_send_check()
For some reason, I forgot to change __tcp_v6_send_check() at
the same time I removed (ip_summed == CHECKSUM_PARTIAL) check
in __tcp_v4_send_check()
Fixes: 72334e1f93f2 ("tcp: remove dead code after CHECKSUM_PARTIAL adoption") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sean Anderson [Fri, 12 Nov 2021 19:04:00 +0000 (14:04 -0500)]
net: macb: Fix several edge cases in validate
There were several cases where validate() would return bogus supported
modes with unusual combinations of interfaces and capabilities. For
example, if state->interface was 10GBASER and the macb had HIGH_SPEED
and PCS but not GIGABIT MODE, then 10/100 modes would be set anyway. In
another case, SGMII could be enabled even if the mac was not a GEM
(despite this being checked for later on in mac_config()). These
inconsistencies make it difficult to refactor this function cleanly.
There is still the open question of what exactly the requirements for
SGMII and 10GBASER are, and what SGMII actually supports. If someone
from Cadence (or anyone else with access to the GEM/MACB datasheet)
could comment on this, it would be greatly appreciated. In particular,
what is supported by Cadence vs. vendor extension/limitation?
To address this, the current logic is split into three parts. First, we
determine what we support, then we eliminate unsupported interfaces, and
finally we set the appropriate link modes. There is still some cruft
related to NA, but this can be removed in a future patch.
Signed-off-by: Sean Anderson <sean.anderson@seco.com> Reviewed-by: Parshuram Thombare <pthombar@cadence.com> Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Link: https://lore.kernel.org/r/20211112190400.1937855-1-sean.anderson@seco.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
We've added 72 non-merge commits during the last 13 day(s) which contain
a total of 171 files changed, 2728 insertions(+), 1143 deletions(-).
The main changes are:
1) Add btf_type_tag attributes to bring kernel annotations like __user/__rcu to
BTF such that BPF verifier will be able to detect misuse, from Yonghong Song.
2) Big batch of libbpf improvements including various fixes, future proofing APIs,
and adding a unified, OPTS-based bpf_prog_load() low-level API, from Andrii Nakryiko.
3) Add ingress_ifindex to BPF_SK_LOOKUP program type for selectively applying the
programmable socket lookup logic to packets from a given netdev, from Mark Pashmfouroush.
4) Remove the 128M upper JIT limit for BPF programs on arm64 and add selftest to
ensure exception handling still works, from Russell King and Alan Maguire.
5) Add a new bpf_find_vma() helper for tracing to map an address to the backing
file such as shared library, from Song Liu.
6) Batch of various misc fixes to bpftool, fixing a memory leak in BPF program dump,
updating documentation and bash-completion among others, from Quentin Monnet.
7) Deprecate libbpf bpf_program__get_prog_info_linear() API and migrate its users as
the API is heavily tailored around perf and is non-generic, from Dave Marchevsky.
8) Enable libbpf's strict mode by default in bpftool and add a --legacy option as an
opt-out for more relaxed BPF program requirements, from Stanislav Fomichev.
9) Fix bpftool to use libbpf_get_error() to check for errors, from Hengqi Chen.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (72 commits)
bpftool: Use libbpf_get_error() to check error
bpftool: Fix mixed indentation in documentation
bpftool: Update the lists of names for maps and prog-attach types
bpftool: Fix indent in option lists in the documentation
bpftool: Remove inclusion of utilities.mak from Makefiles
bpftool: Fix memory leak in prog_dump()
selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning
selftests/bpf: Fix an unused-but-set-variable compiler warning
bpf: Introduce btf_tracing_ids
bpf: Extend BTF_ID_LIST_GLOBAL with parameter for number of IDs
bpftool: Enable libbpf's strict mode by default
docs/bpf: Update documentation for BTF_KIND_TYPE_TAG support
selftests/bpf: Clarify llvm dependency with btf_tag selftest
selftests/bpf: Add a C test for btf_type_tag
selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c
selftests/bpf: Test BTF_KIND_DECL_TAG for deduplication
selftests/bpf: Add BTF_KIND_TYPE_TAG unit tests
selftests/bpf: Test libbpf API function btf__add_type_tag()
bpftool: Support BTF_KIND_TYPE_TAG
libbpf: Support BTF_KIND_TYPE_TAG
...
====================
Please revert. Besides the driver in net, it modifies the I2C core
code. This has not been acked by the I2C maintainer (in this case me).
So, please don't pull this in via the net tree. The question raised here
(extending SMBus calls to 255 byte) is complicated because we need ABI
backwards compatibility.
The various validate method implementations we have in phylink users
have been quite repetitive but also prone to bugs. These patches
introduce a generic implementation which relies solely on the
supported_interfaces bitmap introduced during last cycle, and in the
first patch, a bit array of MAC capabilities.
MAC drivers are free to continue to do their own thing if they have
special requirements - such as mvneta and mvpp2 which do not support
1000base-X without AN enabled. Most implementations currently in the
kernel can be converted to call phylink_generic_validate() directly
from the phylink MAC operations structure once they fill in the
supported_interfaces and mac_capabilities members of phylink_config.
This series introduces the generic implementation, and converts mvneta
and mvpp2 to use it.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Convert mvpp2 to use phylink_generic_validate() for the bulk of its
validate() implementation. This network adapter has a restriction
that for 802.3z links, autonegotiation must be enabled.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Convert mvneta to use phylink_generic_validate() for the bulk of its
validate() implementation. This network adapter has a restriction
that for 802.3z links, autonegotiation must be enabled.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Add a generic validate() implementation using the supported_interfaces
and a bitmask of MAC pause/speed/duplex capabilities. This allows us
to entirely eliminate many driver private validate() implementations.
We expose the underlying phylink_get_linkmodes() function so that
drivers which have special needs can still benefit from conversion.
Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Christophe Leroy [Mon, 15 Nov 2021 11:08:59 +0000 (12:08 +0100)]
net/wan/fsl_ucc_hdlc: fix sparse warnings
CHECK drivers/net/wan/fsl_ucc_hdlc.c
drivers/net/wan/fsl_ucc_hdlc.c:309:57: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:309:57: expected void [noderef] __iomem *
drivers/net/wan/fsl_ucc_hdlc.c:309:57: got restricted __be16 *
drivers/net/wan/fsl_ucc_hdlc.c:311:46: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:311:46: expected void [noderef] __iomem *
drivers/net/wan/fsl_ucc_hdlc.c:311:46: got restricted __be32 *
drivers/net/wan/fsl_ucc_hdlc.c:320:57: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:320:57: expected void [noderef] __iomem *
drivers/net/wan/fsl_ucc_hdlc.c:320:57: got restricted __be16 *
drivers/net/wan/fsl_ucc_hdlc.c:322:46: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:322:46: expected void [noderef] __iomem *
drivers/net/wan/fsl_ucc_hdlc.c:322:46: got restricted __be32 *
drivers/net/wan/fsl_ucc_hdlc.c:372:29: warning: incorrect type in assignment (different base types)
drivers/net/wan/fsl_ucc_hdlc.c:372:29: expected unsigned short [usertype]
drivers/net/wan/fsl_ucc_hdlc.c:372:29: got restricted __be16 [usertype]
drivers/net/wan/fsl_ucc_hdlc.c:379:36: warning: restricted __be16 degrades to integer
drivers/net/wan/fsl_ucc_hdlc.c:402:12: warning: incorrect type in assignment (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:402:12: expected struct qe_bd [noderef] __iomem *bd
drivers/net/wan/fsl_ucc_hdlc.c:402:12: got struct qe_bd *curtx_bd
drivers/net/wan/fsl_ucc_hdlc.c:425:20: warning: incorrect type in assignment (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:425:20: expected struct qe_bd [noderef] __iomem *[assigned] bd
drivers/net/wan/fsl_ucc_hdlc.c:425:20: got struct qe_bd *tx_bd_base
drivers/net/wan/fsl_ucc_hdlc.c:427:16: error: incompatible types in comparison expression (different address spaces):
drivers/net/wan/fsl_ucc_hdlc.c:427:16: struct qe_bd [noderef] __iomem *
drivers/net/wan/fsl_ucc_hdlc.c:427:16: struct qe_bd *
drivers/net/wan/fsl_ucc_hdlc.c:462:33: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:506:41: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:528:33: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:552:38: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:596:67: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:611:41: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:851:38: warning: incorrect type in initializer (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:854:40: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:855:40: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:858:39: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:861:37: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:866:38: warning: incorrect type in initializer (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:868:21: warning: incorrect type in argument 1 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:870:40: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:871:40: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:873:39: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:993:57: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:995:46: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:1004:57: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:1006:46: warning: incorrect type in argument 2 (different address spaces)
drivers/net/wan/fsl_ucc_hdlc.c:412:35: warning: dereference of noderef expression
drivers/net/wan/fsl_ucc_hdlc.c:412:35: warning: dereference of noderef expression
drivers/net/wan/fsl_ucc_hdlc.c:724:29: warning: dereference of noderef expression
drivers/net/wan/fsl_ucc_hdlc.c:815:21: warning: dereference of noderef expression
drivers/net/wan/fsl_ucc_hdlc.c:1021:29: warning: dereference of noderef expression
Most of the warnings are due to DMA memory being incorrectly handled as IO memory.
Fix it by doing direct read/write and doing proper dma_rmb() / dma_wmb().
Other problems are type mismatches or lack of use of IO accessors.
Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reported-by: kernel test robot <lkp@intel.com> Link: https://lkml.org/lkml/2021/11/12/647 Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: David S. Miller <davem@davemloft.net>
Guo Zhengkui [Mon, 15 Nov 2021 05:00:10 +0000 (13:00 +0800)]
hinic: use ARRAY_SIZE instead of ARRAY_LEN
ARRAY_SIZE defined in <linux/kernel.h> is safer than self-defined
macros to get size of an array such as ARRAY_LEN used here. Because
ARRAY_SIZE uses __must_be_array(arr) to ensure arr is really an array.
Reported-by: Alejandro Colomar <colomar.6.4.3@gmail.com> Signed-off-by: Guo Zhengkui <guozhengkui@vivo.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jacky Chou [Mon, 15 Nov 2021 03:49:41 +0000 (11:49 +0800)]
net: usb: ax88179_178a: add TSO feature
On low-effciency embedded platforms, transmission performance is poor
due to on Bulk-out with single packet.
Adding TSO feature improves the transmission performance and reduces
the number of interrupt caused by Bulk-out complete.
Reference to module, net: usb: aqc111.
Signed-off-by: Jacky Chou <jackychou@asix.com.tw> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 15 Nov 2021 14:11:25 +0000 (14:11 +0000)]
Merge branch 'mctp-i2c-driver'
Matt Johnston says:
====================
MCTP I2C driver
This patch series adds a netdev driver providing MCTP transport over
I2C.
It applies against net-next using recent MCTP changes there, though also
has I2C core changes for review. I'll leave it to maintainers where it
should be applied - please let me know if it needs to be submitted
differently.
The I2C patches were previously sent as RFC though the only feedback
there was an ack to 255 bytes for aspeed.
The dt-bindings patch went through review on the list.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Johnston [Mon, 15 Nov 2021 02:49:26 +0000 (10:49 +0800)]
mctp i2c: MCTP I2C binding driver
Provides MCTP network transport over an I2C bus, as specified in
DMTF DSP0237. All messages between nodes are sent as SMBus Block Writes.
Each I2C bus to be used for MCTP is flagged in devicetree by a
'mctp-controller' property on the bus node. Each flagged bus gets a
mctpi2cX net device created based on the bus number. A
'mctp-i2c-controller' I2C client needs to be added under the adapter. In
an I2C mux situation the mctp-i2c-controller node must be attached only
to the root I2C bus. The I2C client will handle incoming I2C slave block
write data for subordinate busses as well as its own bus.
In configurations without devicetree a driver instance can be attached
to a bus using the I2C slave new_device mechanism.
The MCTP core will hold/release the MCTP I2C device while responses
are pending (a 6 second timeout or once a socket is closed, response
received etc). While held the MCTP I2C driver will lock the I2C bus so
that the correct I2C mux remains selected while responses are received.
(Ideally we would just lock the mux to keep the current bus selected for
the response rather than a full I2C bus lock, but that isn't exposed in
the I2C mux API)
This driver requires I2C adapters that allow 255 byte transfers
(SMBus 3.0) as the specification requires a minimum MTU of 68 bytes.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: Jeremy Kerr <jk@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Johnston [Mon, 15 Nov 2021 02:49:25 +0000 (10:49 +0800)]
dt-bindings: net: New binding mctp-i2c-controller
Used to define a local endpoint to communicate with MCTP peripherals
attached to an I2C bus. This I2C endpoint can communicate with remote
MCTP devices on the I2C bus.
In the example I2C topology below (matching the second yaml example) we
have MCTP devices on busses i2c1 and i2c6. MCTP-supporting busses are
indicated by the 'mctp-controller' DT property on an I2C bus node.
A mctp-i2c-controller I2C client DT node is placed at the top of the
mux topology, since only the root I2C adapter will support I2C slave
functionality.
.-------.
|eeprom |
.------------. .------. /'-------'
| adapter | | mux --@0,i2c5------'
| i2c1 ----.*| --@1,i2c6--.--.
|............| \'------' \ \ .........
| mctp-i2c- | \ \ \ .mctpB .
| controller | \ \ '.0x30 .
| | \ ......... \ '.......'
| 0x50 | \ .mctpA . \ .........
'------------' '.0x1d . '.mctpC .
'.......' '.0x31 .
'.......'
(mctpX boxes above are remote MCTP devices not included in the DT at
present, they can be hotplugged/probed at runtime. A DT binding for
specific fixed MCTP devices could be added later if required)
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Reviewed-by: Rob Herring <robh@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
255 byte support has been tested on a npcm750 board
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Reviewed-by: Tali Perry <tali.perry1@gmail.com> Reviewed-by: Patrick Venture <venture@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Johnston [Mon, 15 Nov 2021 02:49:23 +0000 (10:49 +0800)]
i2c: aspeed: Allow 255 byte block transfers
255 byte transfers have been tested on an AST2500 board
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Reviewed-by: Brendan Higgins <brendanhiggins@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matt Johnston [Mon, 15 Nov 2021 02:49:21 +0000 (10:49 +0800)]
i2c: core: Allow 255 byte transfers for SMBus 3.x
SMBus 3.0 increased the maximum block transfer size from 32 bytes to
255 bytes. We increase the size of struct i2c_smbus_data's block[]
member.
i2c_smbus_xfer() and i2c_smbus_xfer_emulated() now support 255 byte
block operations, other block functions remain limited to 32 bytes for
compatibility with existing callers.
We allow adapters to indicate support for the larger size with
I2C_FUNC_SMBUS_V3_BLOCK. Most emulated drivers should be able to use 255
byte blocks by replacing I2C_SMBUS_BLOCK_MAX with I2C_SMBUS_V3_BLOCK_MAX
though some will have hardware limitations that need testing.
Signed-off-by: Matt Johnston <matt@codeconstruct.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
The 'inuse' bitmap is local to this function. So we can use the
non-atomic '__set_bit()' to save a few cycles.
While at it, also remove some useless {}.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: sched: sch_netem: Refactor code in 4-state loss generator
Fixed comments to match description with variable names and
refactored code to match the convention as per [1].
To match the convention mapping is done as follows:
State 3 - LOST_IN_BURST_PERIOD
State 4 - LOST_IN_GAP_PERIOD
[1] S. Salsano, F. Ludovici, A. Ordine, "Definition of a general
and intuitive loss model for packet networks and its implementation
in the Netem module in the Linux kernel"
Fixes: 2aa3f4f94298 ("sch_netem: replace magic numbers with enumerate") Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Acked-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
net: dsa: vsc73xxx: Make vsc73xx_remove() return void
vsc73xx_remove() returns zero unconditionally and no caller checks the
returned value. So convert the function to return no value.
Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de> Reviewed-by: Vladimir Oltean <olteanv@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The previous stmmac_xdp_set_prog() implementation uses stmmac_release()
and stmmac_open() which tear down the PHY device and causes undesirable
autonegotiation which causes a delay whenever AFXDP ZC is setup.
This patch introduces two new functions that just sufficiently tear
down DMA descriptors, buffer, NAPI process, and IRQs and reestablish
them accordingly in both stmmac_xdp_release() and stammac_xdp_open().
As the results of this enhancement, we get rid of transient state
introduced by the link auto-negotiation:
Reported-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: Ong Boon Leong <boon.leong.ong@intel.com> Tested-by: Kurt Kanzenbach <kurt@linutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Hengqi Chen [Mon, 15 Nov 2021 01:24:36 +0000 (09:24 +0800)]
bpftool: Use libbpf_get_error() to check error
Currently, LIBBPF_STRICT_ALL mode is enabled by default for
bpftool which means on error cases, some libbpf APIs would
return NULL pointers. This makes IS_ERR check failed to detect
such cases and result in segfault error. Use libbpf_get_error()
instead like we do in libbpf itself.
Andrii Nakryiko [Mon, 15 Nov 2021 02:34:10 +0000 (18:34 -0800)]
Merge branch 'bpftool: miscellaneous fixes'
Quentin Monnet says:
====================
This set contains several independent minor fixes for bpftool, its
Makefile, and its documentation. Please refer to individual commits for
details.
====================
Quentin Monnet [Wed, 10 Nov 2021 11:46:31 +0000 (11:46 +0000)]
bpftool: Update the lists of names for maps and prog-attach types
To support the different BPF map or attach types, bpftool must remain
up-to-date with the types supported by the kernel. Let's update the
lists, by adding the missing Bloom filter map type and the perf_event
attach type.
Both missing items were found with test_bpftool_synctypes.py.
Quentin Monnet [Wed, 10 Nov 2021 11:46:30 +0000 (11:46 +0000)]
bpftool: Fix indent in option lists in the documentation
Mixed indentation levels in the lists of options in bpftool's
documentation produces some unexpected results. For the "bpftool" man
page, it prints a warning:
$ make -C bpftool.8
GEN bpftool.8
<stdin>:26: (ERROR/3) Unexpected indentation.
For other pages, there is no warning, but it results in a line break
appearing in the option lists in the generated man pages.
RST paragraphs should have a uniform indentation level. Let's fix it.
Fixes: 310c7809be34 ("tools: bpftool: Update and synchronise option list in doc and help msg") Fixes: 89867d944072 ("tools: bpftool: Document and add bash completion for -L, -B options") Signed-off-by: Quentin Monnet <quentin@isovalent.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20211110114632.24537-5-quentin@isovalent.com
Quentin Monnet [Wed, 10 Nov 2021 11:46:28 +0000 (11:46 +0000)]
bpftool: Remove inclusion of utilities.mak from Makefiles
Bpftool's Makefile, and the Makefile for its documentation, both include
scripts/utilities.mak, but they use none of the items defined in this
file. Remove the includes.
Quentin Monnet [Wed, 10 Nov 2021 11:46:27 +0000 (11:46 +0000)]
bpftool: Fix memory leak in prog_dump()
Following the extraction of prog_dump() from do_dump(), the struct btf
allocated in prog_dump() is no longer freed on error; the struct
bpf_prog_linfo is not freed at all. Make sure we release them before
exiting the function.
Yonghong Song [Fri, 12 Nov 2021 20:48:38 +0000 (12:48 -0800)]
selftests/bpf: Fix a tautological-constant-out-of-range-compare compiler warning
When using clang to build selftests with LLVM=1 in make commandline,
I hit the following compiler warning:
benchs/bench_bloom_filter_map.c:84:46: warning: result of comparison of constant 256
with expression of type '__u8' (aka 'unsigned char') is always false
[-Wtautological-constant-out-of-range-compare]
if (args.value_size < 2 || args.value_size > 256) {
~~~~~~~~~~~~~~~ ^ ~~~
The reason is arg.vaue_size has type __u8, so comparison "args.value_size > 256"
is always false.
This patch fixed the issue by doing proper comparison before assigning the
value to args.value_size. The patch also fixed the same issue in two
other places.
Yonghong Song [Fri, 12 Nov 2021 20:48:33 +0000 (12:48 -0800)]
selftests/bpf: Fix an unused-but-set-variable compiler warning
When using clang to build selftests with LLVM=1 in make commandline,
I hit the following compiler warning:
xdpxceiver.c:747:6: warning: variable 'total' set but not used [-Wunused-but-set-variable]
u32 total = 0;
^
This patch fixed the issue by removing that declaration and its
assocatied unused operation.
Linus Torvalds [Fri, 12 Nov 2021 20:25:50 +0000 (12:25 -0800)]
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull more SCSI updates from James Bottomley:
"This series is all the stragglers that didn't quite make the first
merge window pull. It's mostly minor updates and bug fixes of merge
window code but it also has two driver updates: ufs and qla2xxx"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (46 commits)
scsi: scsi_debug: Don't call kcalloc() if size arg is zero
scsi: core: Remove command size deduction from scsi_setup_scsi_cmnd()
scsi: scsi_ioctl: Validate command size
scsi: ufs: ufshpb: Properly handle max-single-cmd
scsi: core: Avoid leaving shost->last_reset with stale value if EH does not run
scsi: bsg: Fix errno when scsi_bsg_register_queue() fails
scsi: sr: Remove duplicate assignment
scsi: ufs: ufs-exynos: Introduce ExynosAuto v9 virtual host
scsi: ufs: ufs-exynos: Multi-host configuration for ExynosAuto v9
scsi: ufs: ufs-exynos: Support ExynosAuto v9 UFS
scsi: ufs: ufs-exynos: Add pre/post_hce_enable drv callbacks
scsi: ufs: ufs-exynos: Factor out priv data init
scsi: ufs: ufs-exynos: Add EXYNOS_UFS_OPT_SKIP_CONFIG_PHY_ATTR option
scsi: ufs: ufs-exynos: Support custom version of ufs_hba_variant_ops
scsi: ufs: ufs-exynos: Add setup_clocks callback
scsi: ufs: ufs-exynos: Add refclkout_stop control
scsi: ufs: ufs-exynos: Simplify drv_data retrieval
scsi: ufs: ufs-exynos: Change pclk available max value
scsi: ufs: Add quirk to enable host controller without PH configuration
scsi: ufs: Add quirk to handle broken UIC command
...
Linus Torvalds [Fri, 12 Nov 2021 20:22:06 +0000 (12:22 -0800)]
Merge tag 'pwm/for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm
Pull pwm updates from Thierry Reding:
"This set is mostly small fixes and cleanups, so more of a janitorial
update for this cycle"
* tag 'pwm/for-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thierry.reding/linux-pwm:
pwm: vt8500: Rename pwm_busy_wait() to make it obviously driver-specific
dt-bindings: pwm: tpu: Add R-Car M3-W+ device tree bindings
dt-bindings: pwm: tpu: Add R-Car V3U device tree bindings
pwm: pwm-samsung: Trigger manual update when disabling PWM
pwm: visconti: Simplify using devm_pwmchip_add()
pwm: samsung: Describe driver in Kconfig
pwm: Make it explicit that pwm_apply_state() might sleep
pwm: Add might_sleep() annotations for !CONFIG_PWM API functions
pwm: atmel: Drop unused header
Linus Torvalds [Fri, 12 Nov 2021 20:17:30 +0000 (12:17 -0800)]
Merge tag 'sound-fix-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Pull sound fixes from Takashi Iwai:
"A collection of fixes for 5.16-rc1, notably for a few regressions that
were found in 5.15 and pre-rc1:
- revert of the unification of SG-buffer helper functions on x86 and
the relevant fix
- regression fixes for mmap after the recent code refactoring
- two NULL dereference fixes in HD-audio controller driver
- UAF fixes in ALSA timer core
- a few usual HD-audio and FireWire quirks"
* tag 'sound-fix-5.16-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
ALSA: fireworks: add support for Loud Onyx 1200f quirk
ALSA: hda: fix general protection fault in azx_runtime_idle
ALSA: hda: Free card instance properly at probe errors
ALSA: hda/realtek: Add quirk for HP EliteBook 840 G7 mute LED
ALSA: memalloc: Remove a stale comment
ALSA: synth: missing check for possible NULL after the call to kstrdup
ALSA: memalloc: Use proper SG helpers for noncontig allocations
ALSA: pci: rme: Fix unaligned buffer addresses
ALSA: firewire-motu: add support for MOTU Track 16
ALSA: PCM: Fix NULL dereference at mmap checks
ALSA: hda/realtek: Add quirk for ASUS UX550VE
ALSA: timer: Unconditionally unlink slave instances, too
ALSA: memalloc: Catch call with NULL snd_dma_buffer pointer
Revert "ALSA: memalloc: Convert x86 SG-buffer handling with non-contiguous type"
ALSA: hda/realtek: Add a quirk for Acer Spin SP513-54N
ALSA: firewire-motu: add support for MOTU Traveler mk3
ALSA: hda/realtek: Headset fixup for Clevo NH77HJQ
ALSA: timer: Fix use-after-free problem
Linus Torvalds [Fri, 12 Nov 2021 20:11:07 +0000 (12:11 -0800)]
Merge tag 'drm-next-2021-11-12' of git://anongit.freedesktop.org/drm/drm
Pull more drm updates from Dave Airlie:
"I missed a drm-misc-next pull for the main pull last week. It wasn't
that major and isn't the bulk of this at all. This has a bunch of
fixes all over, a lot for amdgpu and i915.
bridge:
- HPD improvments for lt9611uxc
- eDP aux-bus support for ps8640
- LVDS data-mapping selection support
ttm:
- remove huge page functionality (needs reworking)
- fix a race condition during BO eviction
nouveau:
- various code style changes
- refcount fix
- device removal fixes
- protect client list with a mutex
- fix CE0 address calculation
i915:
- DP rates related fixes
- Revert disabling dual eDP that was causing state readout problems
- put the cdclk vtables in const data
- Fix DVO port type for older platforms
- Fix blankscreen by turning DP++ TMDS output buffers on encoder->shutdown
- CCS FBs related fixes
- Fix recursive lock in GuC submission
- Revert guc_id from i915_request tracepoint
- Build fix around dmabuf
amdgpu:
- GPU reset fix
- Aldebaran fix
- Yellow Carp fixes
- DCN2.1 DMCUB fix
- IOMMU regression fix for Picasso
- DSC display fixes
- BPC display calculation fixes
- Other misc display fixes
- Don't allow partial copy from user for DC debugfs
- SRIOV fixes
- GFX9 CSB pin count fix
- Various IP version check fixes
- DP 2.0 fixes
- Limit DCN1 MPO fix to DCN1
amdkfd:
- SVM fixes
- Fix gfx version for renoir
- Reset fixes
udl:
- timeout fix
imx:
- circular locking fix
virtio:
- NULL ptr deref fix"
* tag 'drm-next-2021-11-12' of git://anongit.freedesktop.org/drm/drm: (126 commits)
drm/ttm: Double check mem_type of BO while eviction
drm/amdgpu: add missed support for UVD IP_VERSION(3, 0, 64)
drm/amdgpu: drop jpeg IP initialization in SRIOV case
drm/amd/display: reject both non-zero src_x and src_y only for DCN1x
drm/amd/display: Add callbacks for DMUB HPD IRQ notifications
drm/amd/display: Don't lock connection_mutex for DMUB HPD
drm/amd/display: Add comment where CONFIG_DRM_AMD_DC_DCN macro ends
drm/amdkfd: Fix retry fault drain race conditions
drm/amdkfd: lower the VAs base offset to 8KB
drm/amd/display: fix exit from amdgpu_dm_atomic_check() abruptly
drm/amd/amdgpu: fix the kfd pre_reset sequence in sriov
drm/amdgpu: fix uvd crash on Polaris12 during driver unloading
drm/i915/adlp/fb: Prevent the mapping of redundant trailing padding NULL pages
drm/i915/fb: Fix rounding error in subsampled plane size calculation
drm/i915/hdmi: Turn DP++ TMDS output buffers back on in encoder->shutdown()
drm/locking: fix __stack_depot_* name conflict
drm/virtio: Fix NULL dereference error in virtio_gpu_poll
drm/amdgpu: fix SI handling in amdgpu_device_asic_has_dc_support()
drm/amdgpu: Fix dangling kfd_bo pointer for shared BOs
drm/amd/amdkfd: Don't sent command to HWS on kfd reset
...
Linus Torvalds [Fri, 12 Nov 2021 19:44:31 +0000 (11:44 -0800)]
Merge tag 'rtc-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"This includes new ioctls to get and set parameters and in particular
the backup switch mode that is needed for some RTCs to actually enable
the backup voltage (and have a useful RTC).
The same interface can also be used to get the actual features
supported by the RTC so userspace has a better way than trying and
failing.
Summary:
Subsystem:
- Add new ioctl to get and set extra RTC parameters, this includes
backup switch mode
- Expose available features to userspace, in particular, when alarmas
have a resolution of one minute instead of a second.
- Let the core handle those alarms with a minute resolution
New driver:
- MSTAR MSC313 RTC
Drivers:
- Add SPI ID table where necessary
- Add BSM support for rv3028, rv3032 and pcf8523
- s3c: set RTC range
- rx8025: set range, implement .set_offset and .read_offset"
* tag 'rtc-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (50 commits)
rtc: rx8025: use .set_offset/.read_offset
rtc: rx8025: use rtc_add_group
rtc: rx8025: clear RTC_FEATURE_ALARM when alarm are not supported
rtc: rx8025: set range
rtc: rx8025: let the core handle the alarm resolution
rtc: rx8025: switch to devm_rtc_allocate_device
rtc: ab8500: let the core handle the alarm resolution
rtc: ab-eoz9: support UIE when available
rtc: ab-eoz9: use RTC_FEATURE_UPDATE_INTERRUPT
rtc: rv3032: let the core handle the alarm resolution
rtc: s35390a: let the core handle the alarm resolution
rtc: handle alarms with a minute resolution
rtc: pcf85063: silence cppcheck warning
rtc: rv8803: fix writing back ctrl in flag register
rtc: s3c: Add time range
rtc: s3c: Extract read/write IO into separate functions
rtc: s3c: Remove usage of devm_rtc_device_register()
rtc: tps80031: Remove driver
rtc: sun6i: Allow probing without an early clock provider
rtc: pcf8523: add BSM support
...
Sasha Levin [Fri, 12 Nov 2021 15:16:02 +0000 (10:16 -0500)]
tools/lib/lockdep: drop liblockdep
TL;DR: While a tool like liblockdep is useful, it probably doesn't
belong within the kernel tree.
liblockdep attempts to reuse kernel code both directly (by directly
building the kernel's lockdep code) as well as indirectly (by using
sanitized headers). This makes liblockdep an integral part of the
kernel.
It also makes liblockdep quite unique: while other userspace code might
use sanitized headers, it generally doesn't attempt to use kernel code
directly which means that changes on the kernel side of things don't
affect (and break) it directly.
All our workflows and tooling around liblockdep don't support this
uniqueness. Changes that go into the kernel code aren't validated to not
break in-tree userspace code.
liblockdep ended up being very fragile, breaking over and over, to the
point that living in the same tree as the lockdep code lost most of it's
value.
liblockdep should continue living in an external tree, syncing with
the kernel often, in a controllable way.
Linus Torvalds [Fri, 12 Nov 2021 18:56:25 +0000 (10:56 -0800)]
thermal: int340x: fix build on 32-bit targets
Commit 9ebf1a5db1fc ("thermal/drivers/int340x: processor_thermal: Suppot
64 bit RFIM responses") started using 'readq()' to read 64-bit status
responses from the int340x hardware.
That's all fine and good, but on 32-bit targets a 64-bit 'readq()' is
ambiguous, since it's no longer an atomic access. Some hardware might
require 64-bit accesses, and other hardware might want low word first or
high word first.
It's quite likely that the driver isn't relevant in a 32-bit environment
any more, and there's a patch floating around to just make it depend on
X86_64, but let's make it buildable on x86-32 anyway.
The driver previously just read the low 32 bits, so the hardware
certainly is ok with 32-bit reads, and in a little-endian environment
the low word first model is the natural one.
So just add the include for the 'io-64-nonatomic-lo-hi.h' version.
Fixes: 9ebf1a5db1fc ("thermal/drivers/int340x: processor_thermal: Suppot 64 bit RFIM responses") Reported-by: Jakub Kicinski <kuba@kernel.org> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Song Liu [Fri, 12 Nov 2021 15:02:43 +0000 (07:02 -0800)]
bpf: Introduce btf_tracing_ids
Similar to btf_sock_ids, btf_tracing_ids provides btf ID for task_struct,
file, and vm_area_struct via easy to understand format like
btf_tracing_ids[BTF_TRACING_TYPE_[TASK|file|VMA]].
This is caused by hard-coded name[1] in BTF_ID_LIST_GLOBAL (w/o
CONFIG_DEBUG_INFO_BTF). Fix this by adding a parameter n to
BTF_ID_LIST_GLOBAL. This avoids ifdef CONFIG_DEBUG_INFO_BTF in btf.c and
filter.c.
Fixes: 223fe5821395 ("bpf: Introduce helper bpf_find_vma") Reported-by: syzbot+e0d81ec552a21d9071aa@syzkaller.appspotmail.com Reported-by: Eric Dumazet <edumazet@google.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Yonghong Song <yhs@fb.com> Link: https://lore.kernel.org/bpf/20211112150243.1270987-2-songliubraving@fb.com
Otherwise, attaching with bpftool doesn't work with strict section names.
Also:
- Add --legacy option to switch back to pre-1.0 behavior
- Print a warning when program fails to load in strict mode to
point to --legacy flag
- By default, don't append / to the section name; in strict
mode it's relevant only for a small subset of prog types
+ bpftool --legacy prog loadall tools/testing/selftests/bpf/test_cgroup_link.o /sys/fs/bpf/kprobe type kprobe
libbpf: failed to pin program: File exists
Error: failed to pin all programs
+ bpftool prog loadall tools/testing/selftests/bpf/test_cgroup_link.o /sys/fs/bpf/kprobe type kprobe
v1 -> v2:
- strict by default (Quentin Monnet)
- add more info to --legacy description (Quentin Monnet)
- add bash completion (Quentin Monnet)
Merge branch 'Support BTF_KIND_TYPE_TAG for btf_type_tag attributes'
Yonghong Song says:
====================
LLVM patches ([1] for clang, [2] and [3] for BPF backend)
added support for btf_type_tag attributes. This patch
added support for the kernel.
The main motivation for btf_type_tag is to bring kernel
annotations __user, __rcu etc. to btf. With such information
available in btf, bpf verifier can detect mis-usages
and reject the program. For example, for __user tagged pointer,
developers can then use proper helper like bpf_probe_read_kernel()
etc. to read the data.
BTF_KIND_TYPE_TAG may also useful for other tracing
facility where instead of to require user to specify
kernel/user address type, the kernel can detect it
by itself with btf.
Patch 1 added support in kernel, Patch 2 for libbpf and Patch 3
for bpftool. Patches 4-9 are for bpf selftests and Patch 10
updated docs/bpf/btf.rst file with new btf kind.
Changelogs:
v2 -> v3:
- rebase to resolve merge conflicts.
v1 -> v2:
- add more dedup tests.
- remove build requirement for LLVM=1.
- remove testing macro __has_attribute in bpf programs
as it is always defined in recent clang compilers.
====================
Yonghong Song [Fri, 12 Nov 2021 01:26:46 +0000 (17:26 -0800)]
selftests/bpf: Add a C test for btf_type_tag
The following is the main btf_type_tag usage in the
C test:
#define __tag1 __attribute__((btf_type_tag("tag1")))
#define __tag2 __attribute__((btf_type_tag("tag2")))
struct btf_type_tag_test {
int __tag1 * __tag1 __tag2 *p;
} g;
The bpftool raw dump with related types:
[4] INT 'int' size=4 bits_offset=0 nr_bits=32 encoding=SIGNED
[11] STRUCT 'btf_type_tag_test' size=8 vlen=1
'p' type_id=14 bits_offset=0
[12] TYPE_TAG 'tag1' type_id=16
[13] TYPE_TAG 'tag2' type_id=12
[14] PTR '(anon)' type_id=13
[15] TYPE_TAG 'tag1' type_id=4
[16] PTR '(anon)' type_id=15
[17] VAR 'g' type_id=11, linkage=global
With format C dump, we have
struct btf_type_tag_test {
int __attribute__((btf_type_tag("tag1"))) * __attribute__((btf_type_tag("tag1"))) __attribute__((btf_type_tag("tag2"))) *p;
};
The result C code is identical to the original definition except macro's are gone.
Yonghong Song [Fri, 12 Nov 2021 01:26:41 +0000 (17:26 -0800)]
selftests/bpf: Rename progs/tag.c to progs/btf_decl_tag.c
Rename progs/tag.c to progs/btf_decl_tag.c so we can introduce
progs/btf_type_tag.c in the next patch.
Also create a subtest for btf_decl_tag in prog_tests/btf_tag.c
so we can introduce btf_type_tag subtest in the next patch.
I also took opportunity to remove the check whether __has_attribute
is defined or not in progs/btf_decl_tag.c since all recent
clangs should already support this macro.
Yonghong Song [Fri, 12 Nov 2021 01:26:09 +0000 (17:26 -0800)]
bpf: Support BTF_KIND_TYPE_TAG for btf_type_tag attributes
LLVM patches ([1] for clang, [2] and [3] for BPF backend)
added support for btf_type_tag attributes. This patch
added support for the kernel.
The main motivation for btf_type_tag is to bring kernel
annotations __user, __rcu etc. to btf. With such information
available in btf, bpf verifier can detect mis-usages
and reject the program. For example, for __user tagged pointer,
developers can then use proper helper like bpf_probe_read_user()
etc. to read the data.
BTF_KIND_TYPE_TAG may also useful for other tracing
facility where instead of to require user to specify
kernel/user address type, the kernel can detect it
by itself with btf.
Merge branch 'Future-proof more tricky libbpf APIs'
Andrii Nakryiko says:
====================
This patch set continues the work of revamping libbpf APIs that are not
extensible, as they were added before we figured out all the intricacies of
building APIs that can preserve ABI compatibility (both backward and forward).
What makes them tricky is that (most of) these APIs are actively used by
multiple applications, so we need to be careful about refactoring them. See
individual patches for details, but the general approach is similar to
previous bpf_prog_load() API revamp. The biggest different and complexity is
in changing btf_dump__new(), because function overloading through macro magic
doesn't work based on number of arguments, as both new and old APIs have
4 arguments. Because of that, another overloading approach is taken; overload
happens based on argument types.
I've validated manually (by using local test_progs-shared flavor that is
compiling test_progs against libbpf as a shared library) that compiling "old
application" (selftests before being adapted to using new variants of revamped
APIs) are compiled and successfully run against newest libbpf version as well
as the older libbpf version (provided no new variants are used). All these
scenarios seem to be working as expected.
v1->v2:
- add explicit printf_fn NULL check in btf_dump__new() (Alexei);
- replaced + with || in __builtin_choose_expr() (Alexei);
- dropped test_progs-shared flavor (Alexei).
====================
Andrii Nakryiko [Thu, 11 Nov 2021 05:36:24 +0000 (21:36 -0800)]
bpftool: Update btf_dump__new() and perf_buffer__new_raw() calls
Use v1.0-compatible variants of btf_dump and perf_buffer "constructors".
This is also a demonstration of reusing struct perf_buffer_raw_opts as
OPTS-style option struct for new perf_buffer__new_raw() API.
Andrii Nakryiko [Thu, 11 Nov 2021 05:36:20 +0000 (21:36 -0800)]
libbpf: Make perf_buffer__new() use OPTS-based interface
Add new variants of perf_buffer__new() and perf_buffer__new_raw() that
use OPTS-based options for future extensibility ([0]). Given all the
currently used API names are best fits, re-use them and use
___libbpf_override() approach and symbol versioning to preserve ABI and
source code compatibility. struct perf_buffer_opts and struct
perf_buffer_raw_opts are kept as well, but they are restructured such
that they are OPTS-based when used with new APIs. For struct
perf_buffer_raw_opts we keep few fields intact, so we have to also
preserve the memory location of them both when used as OPTS and for
legacy API variants. This is achieved with anonymous padding for OPTS
"incarnation" of the struct. These pads can be eventually used for new
options.
Andrii Nakryiko [Thu, 11 Nov 2021 05:36:19 +0000 (21:36 -0800)]
libbpf: Ensure btf_dump__new() and btf_dump_opts are future-proof
Change btf_dump__new() and corresponding struct btf_dump_ops structure
to be extensible by using OPTS "framework" ([0]). Given we don't change
the names, we use a similar approach as with bpf_prog_load(), but this
time we ended up with two APIs with the same name and same number of
arguments, so overloading based on number of arguments with
___libbpf_override() doesn't work.
Instead, use "overloading" based on types. In this particular case,
print callback has to be specified, so we detect which argument is
a callback. If it's 4th (last) argument, old implementation of API is
used by user code. If not, it must be 2nd, and thus new implementation
is selected. The rest is handled by the same symbol versioning approach.
btf_ext argument is dropped as it was never used and isn't necessary
either. If in the future we'll need btf_ext, that will be added into
OPTS-based struct btf_dump_opts.
struct btf_dump_opts is reused for both old API and new APIs. ctx field
is marked deprecated in v0.7+ and it's put at the same memory location
as OPTS's sz field. Any user of new-style btf_dump__new() will have to
set sz field and doesn't/shouldn't use ctx, as ctx is now passed along
the callback as mandatory input argument, following the other APIs in
libbpf that accept callbacks consistently.
Again, this is quite ugly in implementation, but is done in the name of
backwards compatibility and uniform and extensible future APIs (at the
same time, sigh). And it will be gone in libbpf 1.0.
Andrii Nakryiko [Thu, 11 Nov 2021 05:36:18 +0000 (21:36 -0800)]
libbpf: Turn btf_dedup_opts into OPTS-based struct
btf__dedup() and struct btf_dedup_opts were added before we figured out
OPTS mechanism. As such, btf_dedup_opts is non-extensible without
breaking an ABI and potentially crashing user application.
Unfortunately, btf__dedup() and btf_dedup_opts are short and succinct
names that would be great to preserve and use going forward. So we use
___libbpf_override() macro approach, used previously for bpf_prog_load()
API, to define a new btf__dedup() variant that accepts only struct btf *
and struct btf_dedup_opts * arguments, and rename the old btf__dedup()
implementation into btf__dedup_deprecated(). This keeps both source and
binary compatibility with old and new applications.
The biggest problem was struct btf_dedup_opts, which wasn't OPTS-based,
and as such doesn't have `size_t sz;` as a first field. But btf__dedup()
is a pretty rarely used API and I believe that the only currently known
users (besides selftests) are libbpf's own bpf_linker and pahole.
Neither use case actually uses options and just passes NULL. So instead
of doing extra hacks, just rewrite struct btf_dedup_opts into OPTS-based
one, move btf_ext argument into those opts (only bpf_linker needs to
dedup btf_ext, so it's not a typical thing to specify), and drop never
used `dont_resolve_fwds` option (it was never used anywhere, AFAIK, it
makes BTF dedup much less useful and efficient).
Just in case, for old implementation, btf__dedup_deprecated(), detect
non-NULL options and error out with helpful message, to help users
migrate, if there are any user playing with btf__dedup().
The last remaining piece is dedup_table_size, which is another
anachronism from very early days of BTF dedup. Since then it has been
reduced to the only valid value, 1, to request forced hash collisions.
This is only used during testing. So instead introduce a bool flag to
force collisions explicitly.
This patch also adapts selftests to new btf__dedup() and btf_dedup_opts
use to avoid selftests breakage.