Steve Wise [Thu, 1 Mar 2018 21:57:29 +0000 (13:57 -0800)]
RDMA/nldev: common resource dumpit function
Create a common dumpit function that can be used by all common resource
types. This reduces code replication and simplifies the code as we add
more resource types.
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Wed, 7 Mar 2018 20:40:29 +0000 (15:40 -0500)]
Merge tag 'mlx5-updates-2018-02-28-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-28-1 (IPSec-1)
This series consists of some fixes and refactors for the mlx5 drivers,
especially around the FPGA and flow steering. Most of them are trivial
fixes and are the foundation of allowing IPSec acceleration from user-space.
We use flow steering abstraction in order to accelerate IPSec packets.
When a user creates a steering rule, [s]he states that we'll carry an
encrypt/decrypt flow action (using a specific configuration) for every
packet which conforms to a certain match. Since currently offloading these
packets is done via FPGA, we'll add another set of flow steering ops.
These ops will execute the required FPGA commands and then call the
standard steering ops.
In order to achieve this, we need that the commands will get all the
required information. Therefore, we pass the fte object and embed the
flow_action struct inside the fte. In addition, we add the shim layer
that will later be used for alternating between the standard and the
FPGA steering commands.
Some fixes, like " net/mlx5e: Wait for FPGA command responses with a timeout"
are very relevant for user-space applications, as these applications could
be killed, but we still want to wait for the FPGA and update the kernel's
database.
Zhu Yanjun [Tue, 27 Feb 2018 11:04:32 +0000 (06:04 -0500)]
IB/rxe: remove unnecessary skb_clone
In send_atomic_ack function, it is not necessary to make a
skb_clone. To gain better performance (high throughput and
low latency), this skb_clone is removed.
The following tests are made.
server client
--------- ---------
|1.1.1.1|<----rxe-channel--->|1.1.1.2|
--------- ---------
On server: rping -s -a 1.1.1.1 -v -C 1000 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1000 -S 512
The kernel config CONFIG_DEBUG_KMEMLEAK is enabled on both server
and client.
This test runs for several hours. There is no memory leak and the whole
system can work well.
Based on the above network, the following tests are made.
Bart Van Assche [Fri, 2 Mar 2018 21:14:15 +0000 (13:14 -0800)]
IB/srpt: Add RDMA/CM support
Add a parameter for configuring the port on which the ib_srpt driver
listens for incoming RDMA/CM connections, namely
/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port. The default
value for this parameter is 0 which means "do not listen for incoming
RDMA/CM connections". Add RDMA/CM support to all code that handles
connection state changes. Modify srpt_init_nodeacl() such that ACLs can
be configured for IPv4 and IPv6 addresses.
Note: incoming connection requests are only accepted for ports that
have been enabled. See also the "if (!sport->enabled)" code in the
connection request handler. See also the following configfs attribute:
/sys/kernel/config/target/srpt/$port/$port/enable.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Aviad Yehezkel [Sun, 18 Feb 2018 13:00:54 +0000 (15:00 +0200)]
net/mlx5: Flow steering cmd interface should get the fte when deleting
Previously, deleting a flow steering entry only got the index.
Since the FPGA implementation of FTE's deletion might need to dig
inside the FTE itself, we would like to get the FTE's context.
Changing the interface to pass the FTE context.
Aviad Yehezkel [Sun, 18 Feb 2018 11:17:17 +0000 (13:17 +0200)]
net/mlx5: Add empty egress namespace to flow steering core
Currently, we don't support egress flow steering namespace in mlx5
flow steering core implementation. However, when we want to encrypt
a packet, we model it as a flow steering rule in the egress path.
To overcome this, we add an empty egress namespace to flow steering.
This namespace is initialized only when ipsec support exists.
In the future, this will grow to a full blown full steering
implementation, resembling the ingress path.
Matan Barak [Sun, 20 Aug 2017 12:46:51 +0000 (15:46 +0300)]
net/mlx5: Add shim layer between fs and cmd
The shim layer allows each namespace to define possibly different
functionality for add/delete/update commands. The shim layer
introduced here, will be used to support flow steering with the FPGA.
Matan Barak [Wed, 16 Aug 2017 06:43:48 +0000 (09:43 +0300)]
{net,IB}/mlx5: Add has_tag to mlx5_flow_act
The has_tag member will indicate whether a tag action was specified
in flow specification.
A flow tag 0 = MLX5_FS_DEFAULT_FLOW_TAG is assumed a valid flow tag
that is currently used by mlx5 RDMA driver, whereas in HW flow_tag = 0
means that the user doesn't care about flow_tag. HW always provide
a flow_tag = 0 if all flow tags requested on a specific flow are 0.
So we need a way (in the driver) to differentiate between a user really
requesting flow_tag = 0 and a user who does not care, in order to be
able to report conflicting flow tags on a specific flow.
Boris Pismenny [Wed, 16 Aug 2017 06:33:30 +0000 (09:33 +0300)]
IB/mlx5: Pass mlx5_flow_act struct instead of multiple arguments
Group and pass all function arguments of parse_flow_attr call in one
common struct mlx5_flow_act.
This patch passes all the action arguments of parse_flow_attr in one common
struct mlx5_flow_act. It allows us to scale the number of actions without adding
new arguments to the function.
Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Sun, 19 Nov 2017 15:51:13 +0000 (15:51 +0000)]
net/mlx5: FPGA and IPSec initialization to be before flow steering
Some flow steering namespace initialization (i.e. egress namespace)
might depend on FPGA capabilities. Changing the initialization order
such that the FPGA will be initialized before flow steering.
Flow steering fs cmds initialization might depend on
IPSec capabilities. Changing the initialization order such
that the IPSec will be initialized before flow steering as well.
RDMA/bnxt_re/qplib_sp: Use true and false for boolean values
Assign true or false to boolean variables instead of an integer value.
This issue was detected with the help of Coccinelle.
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Acked-by: Selvin Xavier <selvin.xavier@broadcom.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Sergey Gorenko [Mon, 5 Mar 2018 18:15:56 +0000 (20:15 +0200)]
IB/srp: Use the IB_DEVICE_SG_GAPS_REG HCA feature if supported
If a HCA supports the SG_GAPS_REG feature then fewer memory regions
are required per command. This patch reduces the number of memory
regions that is allocated per SRP session.
Signed-off-by: Sergey Gorenko <sergeygo@mellanox.com> Reviewed-by: Max Gurtovoy <maxg@mellanox.com> Tested-by: Laurence Oberman <loberman@redhat.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Acked-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Mon, 5 Mar 2018 17:58:33 +0000 (09:58 -0800)]
IB/hfi1: Add a missing rcu_read_unlock()
This patch avoids that sparse reports the following:
drivers/infiniband/hw/hfi1/driver.c:251:13: warning: context imbalance in 'rcv_hdrerr' - different lock contexts for basic block
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Shiraz Saleem [Fri, 2 Mar 2018 21:17:14 +0000 (15:17 -0600)]
i40iw: Implement get_vector_affinity API
Storage ULPs (like NVMEoF) benefit from exposing affinity mapping
per completion vector to find the optimal multi-queue affinity
assignments. The ULPs call the verbs API ib_get_vector_affinity
introduced in commit c66cd353bbe ("RDMA/core: expose affinity mappings per
completion vector") to get the underlying devices affinity mappings.
Add support in driver to expose the affinity masks per MSI-X
completion vector.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Shiraz Saleem [Fri, 2 Mar 2018 21:17:13 +0000 (15:17 -0600)]
i40iw: Improve CM node lookup time on connection setup
Currently all CM nodes involved in a connection are
maintained in a connected_node list per dev. During
connection setup, we need to search this every time
we receive a packet on the iWARP LAN Queue (ILQ) and
this can be pretty inefficient for large number of
connections.
Fix this by organizing the CM nodes in two lists -
accelerated list and non-accelerated list. The search
on ILQ receive would be limited to only non accelerated
nodes. When a node moves to RTS, it is added to the
accelerated list.
Benchmarking ucmatose 16k connections shows a 20%
improvement in test completion time.
Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Thu, 1 Mar 2018 22:00:29 +0000 (14:00 -0800)]
RDMA/rxe: Fix an out-of-bounds read
This patch avoids that KASAN reports the following when the SRP initiator
calls srp_post_send():
==================================================================
BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x5c4/0x980 [rdma_rxe]
Read of size 8 at addr ffff880066606e30 by task 02-mq/1074
Fixes: 8700e3e7c485 ("Soft RoCE driver") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Moni Shoua <monis@mellanox.com> Cc: stable@vger.kernel.org Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Thu, 1 Mar 2018 22:00:28 +0000 (14:00 -0800)]
RDMA/core: Avoid that ib_drain_qp() triggers an out-of-bounds stack access
This patch fixes the following KASAN complaint:
==================================================================
BUG: KASAN: stack-out-of-bounds in rxe_post_send+0x77d/0x9b0 [rdma_rxe]
Read of size 8 at addr ffff880061aef860 by task 01/1080
Fixes: 765d67748bcf ("IB: new common API for draining queues") Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: stable@vger.kernel.org Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Colin Ian King [Thu, 1 Mar 2018 16:23:54 +0000 (16:23 +0000)]
infiniband: remove redundant assignment to pointer 'rdi'
The pointer rdi is being initialized with a value that is never read
and re-assigned immediately after, hence the initialization is redundant
and can be removed.
Cleans up clang warning:
drivers/infiniband/sw/rdmavt/vt.c:94:23: warning: Value stored to 'rdi'
during its initialization is never read
Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Hernán Gonzalez [Tue, 27 Feb 2018 22:05:43 +0000 (19:05 -0300)]
IB/qib: Move char *qib_sdma_state_names[] and constify while there.
Note: This is compile only tested as I have no access to the hw.
This variable was not used in qib_sdma.c but in qib_iba7322.c. Declaring it
there, as static, saves 56 bytes.
add/remove: 0/2 grow/shrink: 0/0 up/down: 0/-144 (-144)
Function old new delta
qib_sdma_state_names 56 - -56
qib_sdma_event_names 88 - -88
Total: Before=2874565, After=2874421, chg -0.01%
Signed-off-by: Hernán Gonzalez <hernan@vanguardiasur.com.ar> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Fri, 23 Feb 2018 22:09:26 +0000 (14:09 -0800)]
IB/srp: Use %pIS instead of inet_ntop()
Except for a minor log message change, this patch does not change
any functionality. For the introduction of %pIS, see also commit 1067964305df ("lib: vsprintf: add IPv4/v6 generic %p[Ii]S[pfs]
format specifier").
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Fri, 23 Feb 2018 22:09:25 +0000 (14:09 -0800)]
Revert "IB/srp: Avoid that a cable pull can trigger a kernel crash"
The caller of srp_ib_lookup_path() is responsible for holding a reference
on the SCSI host. That means that commit 8a0d18c62121 was not necessary.
Hence revert it.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Bart Van Assche [Fri, 23 Feb 2018 22:09:24 +0000 (14:09 -0800)]
IB/srp: Fix srp_abort()
Before commit e494f6a72839 ("[SCSI] improved eh timeout handler") it
did not really matter whether or not abort handlers like srp_abort()
called .scsi_done() when returning another value than SUCCESS. Since
that commit however this matters. Hence only call .scsi_done() when
returning SUCCESS.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: stable@vger.kernel.org Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Arnd Bergmann [Tue, 20 Feb 2018 20:56:27 +0000 (21:56 +0100)]
infiniband: bnxt_re: use BIT_ULL() for 64-bit bit masks
On 32-bit targets, we otherwise get a warning about an impossible constant
integer expression:
In file included from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/infiniband/hw/bnxt_re/ib_verbs.c:39:
drivers/infiniband/hw/bnxt_re/ib_verbs.c: In function 'bnxt_re_query_device':
include/linux/bitops.h:7:24: error: left shift count >= width of type [-Werror=shift-count-overflow]
#define BIT(nr) (1UL << (nr))
^~
drivers/infiniband/hw/bnxt_re/bnxt_re.h:61:34: note: in expansion of macro 'BIT'
#define BNXT_RE_MAX_MR_SIZE_HIGH BIT(39)
^~~
drivers/infiniband/hw/bnxt_re/bnxt_re.h:62:30: note: in expansion of macro 'BNXT_RE_MAX_MR_SIZE_HIGH'
#define BNXT_RE_MAX_MR_SIZE BNXT_RE_MAX_MR_SIZE_HIGH
^~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/bnxt_re/ib_verbs.c:149:25: note: in expansion of macro 'BNXT_RE_MAX_MR_SIZE'
ib_attr->max_mr_size = BNXT_RE_MAX_MR_SIZE;
^~~~~~~~~~~~~~~~~~~
Fixes: 872f3578241d ("RDMA/bnxt_re: Add support for MRs with Huge pages") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Arnd Bergmann [Tue, 20 Feb 2018 20:56:26 +0000 (21:56 +0100)]
infiniband: qplib_fp: fix pointer cast
Building for a 32-bit target results in a couple of warnings from casting
between a 32-bit pointer and a 64-bit integer:
drivers/infiniband/hw/bnxt_re/qplib_fp.c: In function 'bnxt_qplib_service_nq':
drivers/infiniband/hw/bnxt_re/qplib_fp.c:333:23: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
bnxt_qplib_arm_srq((struct bnxt_qplib_srq *)q_handle,
^
drivers/infiniband/hw/bnxt_re/qplib_fp.c:336:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
(struct bnxt_qplib_srq *)q_handle,
^
In file included from include/linux/byteorder/little_endian.h:5,
from arch/arm/include/uapi/asm/byteorder.h:22,
from include/asm-generic/bitops/le.h:6,
from arch/arm/include/asm/bitops.h:342,
from include/linux/bitops.h:38,
from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/infiniband/hw/bnxt_re/qplib_fp.c:39:
drivers/infiniband/hw/bnxt_re/qplib_fp.c: In function 'bnxt_qplib_create_srq':
include/uapi/linux/byteorder/little_endian.h:31:43: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
#define __cpu_to_le64(x) ((__force __le64)(__u64)(x))
^
include/linux/byteorder/generic.h:86:21: note: in expansion of macro '__cpu_to_le64'
#define cpu_to_le64 __cpu_to_le64
^~~~~~~~~~~~~
drivers/infiniband/hw/bnxt_re/qplib_fp.c:569:19: note: in expansion of macro 'cpu_to_le64'
req.srq_handle = cpu_to_le64(srq);
Using a uintptr_t as an intermediate works on all architectures.
Fixes: 37cb11acf1f7 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:39 +0000 (12:18 +0200)]
IB/uverbs: Tidy uverbs_uobject_add
Maintaining the uobjects list is mandatory, hoist it into the common
rdma_alloc_commit_uobject() function and inline it as there is now
only one caller.
Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Doug Ledford [Wed, 28 Feb 2018 18:37:39 +0000 (13:37 -0500)]
Merge tag 'mlx5-updates-2018-02-23' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-update-2018-02-23 (IB representors)
From: Mark Bloch <markb@mellanox.com>
=========
Add IB representor when in switchdev mode
The following series adds support for an IB (RAW Ethernet only) device
representor which is created when the user switches to switchdev mode.
Today when switching to switchdev mode the only representors which are
created are net devices. Each netdev is a representor of a virtual
function and any data sent via the representor is received on the virtual
function, and any data sent via the virtual function is received by the
representor.
For the mlx5 driver the main use of this functionality is to be able to
use Open vSwitch on the hypervisor in order to manage/control traffic
from/to the virtual functions. Open vSwitch can also work with DPDK
devices and not just net devices, this series exposes an IB device, which
Mellanox PMD driver uses, which then can be used by Open vSwitch DPDK.
An IB device representor exposes only RAW Ethernet QP capabilities and
the ability to create flow rules to direct traffic to its RX queues. The
state of the IB device (ACTIVE/DOWN etc..) is based on the state of the
corresponding net device representor. No other RDMA/RoCE functionality is
currently supported and no GID table is exposed.
=========
Mark Bloch [Mon, 15 Jan 2018 13:11:37 +0000 (13:11 +0000)]
IB/mlx5: Disable self loopback check when in switchdev mode
When in switchdev mode, there is no need to do self loopback checks
as we can't receive those packets, we insert steering rules to the
eswitch that make sure packets can't be looped back.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 23 Jan 2018 11:24:13 +0000 (11:24 +0000)]
net/mlx5: E-Switch, Reload IB interface when switching devlink modes
Up until this point it wasn't possible to activate IB representors
when switching to switchdev mode, remove this limitation.
We trigger reload of the PF IB interface in order to make sure that
already allocated resources are invalid and new resources will be opened
correctly with all the limitations of switchdev mode applied (only raw
packet capabilities, without RoCE). We also move the remove/add to a
place where the E-Switch mode is set/unset to better control when to
trigger this action, this will allow the IB side to start in the correct
mode.
For better code reuse, create a function which reloads an interface and
export it.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 23 Jan 2018 11:16:30 +0000 (11:16 +0000)]
IB/mlx5: Add proper representors support
This commit adds full support for IB representor:
1) Representors profile, We add two new profiles:
nic_rep_profile - This profile will be used to create an IB device that
represents the PF/UPLINK.
rep_profile - This profile will be used to create an IB device that
represents VFs. Each VF will be its own representor.
2) Proper load/unload callbacks, Those are called by the E-Switch when
moving to/from switchdev mode.
3) Different flow DB handling for when we in switchdev mode.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Mon, 29 Jan 2018 10:40:37 +0000 (10:40 +0000)]
IB/mlx5: E-Switch, Add rule to forward traffic to vport
In order to forward traffic from representor's SQ to the right virtual
function, every time an SQ is created also add the corresponding flow rule
to the FDB.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Mon, 22 Jan 2018 15:29:44 +0000 (15:29 +0000)]
IB/mlx5: Don't expose MR cache in switchdev mode
When enabling many VFs and switching to switchdev mode, the total amount
of mkeys we try to allocate when loading representors is very large and
may cause timeouts on allocations, the same issues was observed on VFs
and we employ the same fix that was done for them. We avoid allocating
the full MR cache on load but still allow it to be manipulated once the
IB device is loaded.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Mon, 6 Nov 2017 12:22:13 +0000 (12:22 +0000)]
IB/mlx5: When in switchdev mode, expose only raw packet capabilities
Currently in switchdev mode we allow only for raw packet QPs.
Expose the right capabilities and set the gid table length to 0, also
make sure we don't try to enable RoCE, so split the function
to enable RoCE so representors can enable only the notifier needed for
net device events.
Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 15:02:36 +0000 (15:02 +0000)]
IB/mlx5: Listen to netdev register/unresiter events in switchdev mode
Currently we listen to netdev register/unregister event based on PCI
device. When in switchdev mode PF and representors share the same PCI
device, so in order to pair ib device and netdev in switchdev mode
compare the netdev that triggered the event to that of the representor.
Expose a function that lets you receive the netdev associated what
a given representor.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:44:29 +0000 (14:44 +0000)]
IB/mlx5: Add match on vport when in switchdev mode
When we point to a representor, it means we are in switchdev mode.
The flow db is shared between PF and virtual function representors
so each rule created needs to have a match on its specific source port.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:42:35 +0000 (14:42 +0000)]
IB/mlx5: Allocate flow DB only on PF IB device
A flow DB is a shared resource between PF and representors,
need to allocate it only when creating the PF IB device.
Once we add IB representors, they will use the flow db which was
created by the PF.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Create the basic infrastructure of registering and unregistering
IB representors. The load/unload callbacks are left empty and
proper implementation will be introduced in following patches.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:13:46 +0000 (14:13 +0000)]
net/mlx5: E-Switch, Add definition of IB representor
Create a new representor type: REP_IB. which will be initialized by an IB
device that is used as a logical representor of a eswitch vport (VF or
uplink) just like we have a net device today in switchdev mode.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 30 Jan 2018 10:46:34 +0000 (10:46 +0000)]
net/mlx5: E-Switch, Optimize HW steering tables in switchdev mode
Under switchdev mode we insert an eswitch miss rule causing any
unmatched traffic to be sent towards the PF vport. This miss rule can
be optimized if we break it to two, one case is for multicast traffic and
the other for unicast.
Breaking the miss rule into two (unicast and multicast) allows the firmware
to program the hardware in a more efficient way.
Using ConncetX-5 Ex with IXIA and testpmd (which use IB representors):
IXIA -> NIC -> PF -> IB representor -> NIC -> VF:
- Without this optimization: 9.2 MPPS.
- With this optimization: 18 MPPS.
VF -> NIC -> IB representor-> PF -> NIC -> IXIA:
- Without this optimization: 17 MPPS.
- With this optimization: 23.4 MPPS.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 23 Jan 2018 11:19:12 +0000 (11:19 +0000)]
net/mlx5: E-Switch, Increase number of FTEs in FDB in switchdev mode
The max FTE number should be the max number of SQs that can be opened.
Ethernet representors open one SQ each. Once we add IB representor this
will increase (depends on the user). For now lets start with 31
per IB representor and if needed increase in the future.
This increase only affects the number of FTEs in the slow path FDB,
offloaded rules (done via TC on the fast path portion of the FDB)
aren't affected.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Mark Bloch [Tue, 16 Jan 2018 14:04:14 +0000 (14:04 +0000)]
net/mlx5: E-Switch, Move representors definition to a global scope
In preparation for IB representors, move representors structs to a global
scope, also expose functions needed for registration, unregistration,
eswitch mode and creating a flow rule to direct traffic from SQs to the
right VF.
Signed-off-by: Mark Bloch <markb@mellanox.com> Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:42 +0000 (18:12 +0200)]
RDMA/uverbs: Replace user's types with kernel's types
The internal to kernel variable declarations don't need to be
declared with user types. This patch converts such occurrences
appeared in ib_uverbs_write().
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
and IB_USER_VERBS_CMD_CREATE_CQ <= IB_USER_VERBS_CMD_OPEN_QP,
so if we execute IB_USER_VERBS_EX_CMD_CREATE_CQ this code checks
ib_dev->uverbs_cmd_mask not ib_dev->uverbs_ex_cmd_mask.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:36 +0000 (18:12 +0200)]
RDMA/uverbs: Unify return values of not supported command
The non-existing command is supposed to return -EOPNOTSUPP, but the
current code returns different errors for different flows for the
same failure. This patch unifies those flows.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 21 Feb 2018 16:12:33 +0000 (18:12 +0200)]
RDMA/uverbs: Refactor flags checks and update return value
Since commit f21519b23c1b ("IB/core: extended command: an
improved infrastructure for uverbs commands"), the uverbs
supports extra flags as an input to the command interface.
However actually, there is only one flag available and used,
so it is better to refactor the code, so the resolution and
report to the users is done as early as possible.
As part of this change, we changed the return value of failure case
from ENOSYS to be EINVAL to be consistent with the rest flags checks.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Fri, 23 Feb 2018 03:27:20 +0000 (22:27 -0500)]
Merge branch 'k.o/for-rc' into k.o/wip/dl-for-next
There is a 14 patch series waiting to come into for-next that has a
dependecy on code submitted into this kernel's for-rc series. So, merge
the for-rc branch into the current for-next in order to make the patch
series apply cleanly.
Doug Ledford [Fri, 23 Feb 2018 01:52:28 +0000 (20:52 -0500)]
Merge tag 'mlx5-updates-2018-02-21' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-21
This series includes shared code updates for mlx5 core driver for both
netdev and rdma subsystems.
By Saeed,
First six patches of the series are meant to address a performance issue
and should provide a performance boost for multi core IRQ interrupt hungry
workloads. The issue is fixed in the first patch, all other patches are
meant to refactor the code in light of this fix.
The problem it comes to fix, is a shared spinlock accessed across all HCA
IRQs which protects the CQ database. To solve this we simply move the CQ
database and its spinlock to be per EQ (IRQ), thus per core.
By Yonatan,
Fragmented completion queue (CQ) for RDMA,
core driver implementation to create fragmented CQ buffers rather than
one large contiguous memory buffer, the implementation scheme already
exist and used by the netdev CQs, the patch shares that code with the
rdma CQ creation flow and makes use of the new API in mlx5_ib driver.
Thanks,
Saeed.
Merged into rdma-next tree as well as net-next tree to prevent conflicts
in future patches between the two trees.
Leon Romanovsky [Wed, 21 Feb 2018 08:25:01 +0000 (10:25 +0200)]
RDMA/uverbs: Fix kernel panic while using XRC_TGT QP type
Attempt to modify XRC_TGT QP type from the user space (ibv_xsrq_pingpong
invocation) will trigger the following kernel panic. It is caused by the
fact that such QPs missed uobject initialization.
Selvin Xavier [Fri, 16 Feb 2018 05:20:13 +0000 (21:20 -0800)]
RDMA/bnxt_re: Avoid system hang during device un-reg
BNXT_RE_FLAG_TASK_IN_PROG doesn't handle multiple work
requests posted together. Track schedule of multiple
workqueue items by maintaining a per device counter
and proceed with IB dereg only if this counter is zero.
flush_workqueue is no longer required from
NETDEV_UNREGISTER path.
Selvin Xavier [Fri, 16 Feb 2018 05:20:12 +0000 (21:20 -0800)]
RDMA/bnxt_re: Fix system crash during load/unload
During driver unload, the driver proceeds with cleanup
without waiting for the scheduled events. So the device
pointers get freed up and driver crashes when the events
are scheduled later.
Flush the bnxt_re_task work queue before starting
device removal.
Steve Wise [Thu, 15 Feb 2018 02:43:36 +0000 (18:43 -0800)]
RDMA/restrack: don't use uaccess_kernel()
uaccess_kernel() isn't sufficient to determine if an rdma resource is
user-mode or not. For example, resources allocated in the add_one()
function of an ib_client get falsely labeled as user mode, when they
are kernel mode allocations. EG: mad qps.
The result is that these qps are skipped over during a nldev query
because of an erroneous namespace mismatch.
So now we determine if the resource is user-mode by looking at the object
struct's uobject or similar pointer to know if it was allocated for user
mode applications.
Fixes: 02d8883f520e ("RDMA/restrack: Add general infrastructure to track RDMA resources") Signed-off-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Ensure that cv_end is equal to ibdev->num_comp_vectors for the
NUMA node with the highest index. This patch improves spreading
of RDMA channels over completion vectors and thereby improves
performance, especially on systems with only a single NUMA node.
This patch drops support for the comp_vector login parameter by
ignoring the value of that parameter since I have not found a
good way to combine support for that parameter and automatic
spreading of RDMA channels over completion vectors.
Fixes: d92c0da71a35 ("IB/srp: Add multichannel support") Reported-by: Alexander Schmid <alex@modula-shop-systems.de> Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Cc: Alexander Schmid <alex@modula-shop-systems.de> Cc: stable@vger.kernel.org Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Adit Ranadive [Thu, 15 Feb 2018 20:36:46 +0000 (12:36 -0800)]
RDMA/vmw_pvrdma: Fix usage of user response structures in ABI file
This ensures that we return the right structures back to userspace.
Otherwise, it looks like the reserved fields in the response structures
in userspace might have uninitialized data in them.
Leon Romanovsky [Wed, 14 Feb 2018 10:35:40 +0000 (12:35 +0200)]
RDMA/uverbs: Sanitize user entered port numbers prior to access it
==================================================================
BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs+0x6f2/0x8c0
Read of size 4 at addr ffff88006476a198 by task syzkaller697701/265
Cc: syzkaller <syzkaller@googlegroups.com> Cc: <stable@vger.kernel.org> # 4.11 Fixes: fd3c7904db6e ("IB/core: Change idr objects to use the new schema") Reported-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 14 Feb 2018 10:35:38 +0000 (12:35 +0200)]
RDMA/uverbs: Fix bad unlock balance in ib_uverbs_close_xrcd
There is no matching lock for this mutex. Git history suggests this is
just a missed remnant from an earlier version of the function before
this locking was moved into uverbs_free_xrcd.
Originally this lock was protecting the xrcd_table_delete()
=====================================
WARNING: bad unlock balance detected!
4.15.0+ #87 Not tainted
-------------------------------------
syzkaller223405/269 is trying to release lock (&uverbs_dev->xrcd_tree_mutex) at:
[<00000000b8703372>] ib_uverbs_close_xrcd+0x195/0x1f0
but there are no more locks to release!
other info that might help us debug this:
1 lock held by syzkaller223405/269:
#0: (&uverbs_dev->disassociate_srcu){....}, at: [<000000005af3b960>] ib_uverbs_write+0x265/0xef0
Cc: syzkaller <syzkaller@googlegroups.com> Cc: <stable@vger.kernel.org> # 4.11 Fixes: fd3c7904db6e ("IB/core: Change idr objects to use the new schema") Reported-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Wed, 14 Feb 2018 10:35:37 +0000 (12:35 +0200)]
RDMA/restrack: Increment CQ restrack object before committing
Once the uobj is committed it is immediately possible another thread
could destroy it, which worst case, can result in a use-after-free
of the restrack objects.
Cc: syzkaller <syzkaller@googlegroups.com> Fixes: 08f294a1524b ("RDMA/core: Add resource tracking for create and destroy CQs") Reported-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Tue, 13 Feb 2018 10:18:41 +0000 (12:18 +0200)]
RDMA/uverbs: Protect from command mask overflow
The command number is not bounds checked against the command mask before it
is shifted, resulting in an ubsan hit. This does not cause malfunction since
the command number is eventually bounds checked, but we can make this ubsan
clean by moving the bounds check to before the mask check.
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:40 +0000 (12:18 +0200)]
IB/uverbs: Fix unbalanced unlock on error path for rdma_explicit_destroy
If remove_commit fails then the lock is left locked while the uobj still
exists. Eventually the kernel will deadlock.
lockdep detects this and says:
test/4221 is leaving the kernel with locks still held!
1 lock held by test/4221:
#0: (&ucontext->cleanup_rwsem){.+.+}, at: [<000000001e5c7523>] rdma_explicit_destroy+0x37/0x120 [ib_uverbs]
Fixes: 4da70da23e9b ("IB/core: Explicitly destroy an object while keeping uobject") Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:38 +0000 (12:18 +0200)]
IB/uverbs: Improve lockdep_check
This is really being used as an assert that the expected usecnt
is being held and implicitly that the usecnt is valid. Rename it to
assert_uverbs_usecnt and tighten the checks to only accept valid
values of usecnt (eg 0 and < -1 are invalid).
The tigher checkes make the assertion cover more cases and is more
likely to find bugs via syzkaller/etc.
Fixes: 3832125624b7 ("IB/core: Add support for idr types") Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Tue, 13 Feb 2018 10:18:37 +0000 (12:18 +0200)]
RDMA/uverbs: Protect from races between lookup and destroy of uobjects
The race is between lookup_get_idr_uobject and
uverbs_idr_remove_uobj -> uverbs_uobject_put.
We deliberately do not call sychronize_rcu after the idr_remove in
uverbs_idr_remove_uobj for performance reasons, instead we call
kfree_rcu() during uverbs_uobject_put.
However, this means we can obtain pointers to uobj's that have
already been released and must protect against krefing them
using kref_get_unless_zero.
==================================================================
BUG: KASAN: use-after-free in copy_ah_attr_from_uverbs.isra.2+0x860/0xa00
Read of size 4 at addr ffff88005fda1ac8 by task syz-executor2/441
Memory state around the buggy address: ffff88005fda1980: fc fc fb fb fb fb fc fc fb fb fb fb fc fc fb fb ffff88005fda1a00: fb fb fc fc fb fb fb fb fc fc 00 00 00 00 fc fc ffff88005fda1a80: fb fb fb fb fc fc fb fb fb fb fc fc fb fb fb fb ffff88005fda1b00: fc fc 00 00 00 00 fc fc fb fb fb fb fc fc fb fb ffff88005fda1b80: fb fb fc fc fb fb fb fb fc fc fb fb fb fb fc fc
==================================================================@
Cc: syzkaller <syzkaller@googlegroups.com> Cc: <stable@vger.kernel.org> # 4.11 Fixes: 3832125624b7 ("IB/core: Add support for idr types") Reported-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:36 +0000 (12:18 +0200)]
IB/uverbs: Hold the uobj write lock after allocate
This clarifies the design intention that time between allocate and
commit has the uobj exclusive to the caller. We already guarantee
this by delaying publishing the uobj pointer via idr_insert,
fd_install, list_add, etc.
Additionally holding the usecnt lock during this period provides
extra clarity and more protection against future mistakes.
Fixes: 3832125624b7 ("IB/core: Add support for idr types") Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Tue, 13 Feb 2018 10:18:35 +0000 (12:18 +0200)]
IB/uverbs: Fix possible oops with duplicate ioctl attributes
If the same attribute is listed twice by the user in the ioctl attribute
list then error unwind can cause the kernel to deref garbage.
This happens when an object with WRITE access is sent twice. The second
parse properly fails but corrupts the state required for the error unwind
it triggers.
Fixing this by making duplicates in the attribute list invalid. This is
not something we need to support.
The ioctl interface is currently recommended to be disabled in kConfig.
Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Matan Barak [Tue, 13 Feb 2018 10:18:34 +0000 (12:18 +0200)]
IB/uverbs: Add ioctl support for 32bit processes
32 bit processes running on a 64 bit kernel call compat_ioctl so that
implementations can revise any structure layout issues. Point compat_ioctl
at our normal ioctl because:
- All our structures are designed to be the same on 32 and 64 bit, ie we
use __aligned_u64 when required and are careful to manage padding.
- Any pointers are stored in u64's and userspace is expected
to prepare them properly.
Signed-off-by: Matan Barak <matanb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Feb 2018 10:18:33 +0000 (12:18 +0200)]
IB/uverbs: Use __aligned_u64 for uapi headers
This has no impact on the structure layout since these structs already
have their u64s already properly aligned, but it does document that we
have this requirement for 32 bit compatibility.
Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>