Jason Gunthorpe [Fri, 16 Mar 2018 03:18:14 +0000 (21:18 -0600)]
RDMA/bnxt: Fix structure layout for bnxt_re_pd_resp
What is going on here is a bit subtle, in the kernel there is no
problem because the struct is copied using copy_from_user, so it
can safely have an 8 byte alignment, however in userspace it must
be constructed by concatenation with the ib_uverbs_alloc_pd_resp
struct. This is due to the required memory layout to execute the
command.
Since ibv_uverbs_alloc_pd_resp is only 4 bytes long, this causes
misalignment, and the user space will experience an unexpected padding.
Currently it works around this via pointer maths.
Make everything more robust by having the compiler reduce the alignment
of the struct to 4. The userspace has assertions to ensure this
works properly in all situations.
Honggang Li [Fri, 16 Mar 2018 02:37:13 +0000 (10:37 +0800)]
IB/mlx5: Set the default active rate and width to QDR and 4X
Before commit f1b65df5a232 ("IB/mlx5: Add support for active_width and
active_speed in RoCE"), the mlx5_ib driver set the default active_width
and active_speed to IB_WIDTH_4X and IB_SPEED_QDR.
When the RoCE port is down, the RoCE port does not negotiate the active
width with the remote side, causing the active width to be zero. When
running userspace ibstat to view the port status, ibstat will panic as it
reads an invalid width from sys file.
This patch restores the original behavior.
Fixes: f1b65df5a232 ("IB/mlx5: Add support for active_width and active_speed in RoCE"). Signed-off-by: Honggang Li <honli@redhat.com> Reviewed-by: Hal Rosenstock <hal@mellanox.com> Reviewed-by: Noa Osherovich <noaos@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Honggang Li [Thu, 15 Mar 2018 09:02:13 +0000 (17:02 +0800)]
IB/core: Set speed string to SDR for invalid active rates
Before commit f1b65df5a232 ("IB/mlx5: Add support for active_width and
active_speed in RoCE"), the mlx5_ib driver set default active_width and
active_speed to IB_WIDTH_4X and IB_SPEED_QDR.
Now, the active_width and active_speed are zeros if the RoCE port
is in DOWN state. The speed string should be set to " SDR" instead of
a blank string when active_speed is zero.
Signed-off-by: Honggang Li <honli@redhat.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Leon Romanovsky [Thu, 15 Mar 2018 09:10:42 +0000 (11:10 +0200)]
RDMA/restrack: Don't rely on uninitialized variable in restrack_add flow
The restrack code relies on the fact that object structures are zeroed at
the allocation stage, the mlx4 CQ wasn't allocated with kzalloc and it
caused to the following crash.
Fixes: 00313983cda6 ("RDMA/nldev: provide detailed CM_ID information") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Steve Wise <swise@opengridcomputing.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Guy Levi [Thu, 15 Mar 2018 14:56:40 +0000 (16:56 +0200)]
IB/mlx4: Add Scatter FCS support over WQ creation
As a default, for Ethernet packets, the device scatters only the payload
of ingress packets. The scatter FCS feature lets the user to get the FCS
(Ethernet's frame check sequence) in the received WR's buffer as a 4
Bytes trailer following the packet's payload.
Reviewed-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Guy Levi <guyle@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yishai Hadas [Thu, 15 Mar 2018 14:56:39 +0000 (16:56 +0200)]
IB/mlx4: Report TSO capabilities
Report to the user area the TSO device capabilities, it includes the
max_tso size and the QP types that support it.
The TSO is applicable only when when of the ports is ETH and the device
supports it.
uresp logic around rss_caps is updated to fix a till-now harmless bug
computing the length of the structure to copy. The code did not handle the
implicit padding before rss_caps correctly. This is necessay to copy
tss_caps successfully.
Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Yishai Hadas <yishaih@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Henry Orosco [Wed, 14 Mar 2018 19:45:23 +0000 (14:45 -0500)]
i40iw: Tear-down connection after CQP Modify QP failure
There is no explicit tear-down sequence initiated on
connections if the Control QP OP, Modify QP to close,
fails. Fix this by triggering a driver generated
Asynchronous Event (AE) on Modify QP failures and
tear-down the connection on receipt of the AE.
This fix can be generalized to other Modify QP failures
(i.e. RTS->TERM, IDLE->RTS, etc) as any modify failure
will require a connection tear-down.
Fixes: d37498417947 ("i40iw: add files for iwarp interface") Signed-off-by: Henry Orosco <henry.orosco@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Henry Orosco [Wed, 14 Mar 2018 19:45:22 +0000 (14:45 -0500)]
i40iw: Refactor of driver generated AEs
The flush CQP OP can be used to optionally generate
Asynchronous Events (AEs) in addition to QP flush.
Consolidate all HW AE generation code under a new
function i40iw_gen_ae which use the flush CQP OP
to only generate AEs.
Signed-off-by: Henry Orosco <henry.orosco@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Jason Gunthorpe [Tue, 13 Mar 2018 22:26:46 +0000 (16:26 -0600)]
RDMA/mlx4: Move flag constants to uapi header
MLX4_USER_DEV_CAP_LARGE_CQE (via mlx4_ib_alloc_ucontext_resp.dev_caps)
and MLX4_IB_QUERY_DEV_RESP_MASK_CORE_CLOCK_OFFSET (via
mlx4_uverbs_ex_query_device_resp.comp_mask) are copied directly to
userspace and form part of the uAPI.
Jason Gunthorpe [Tue, 13 Mar 2018 22:33:18 +0000 (16:33 -0600)]
RDMA/rxe: Use structs to describe the uABI instead of opencoding
Open coding pointer math is not acceptable for describing the uABI in
RDMA. Provide structs for all the cases.
The udata is casted to the struct as close to the verbs entry point
as possible for maximum clarity. Function signatures and so forth
are revised to allow for this.
Tested-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Tejun Heo [Wed, 14 Mar 2018 19:45:10 +0000 (12:45 -0700)]
RDMAVT: Fix synchronization around percpu_ref
rvt_mregion uses percpu_ref for reference counting and RCU to protect
accesses from lkey_table. When a rvt_mregion needs to be freed, it
first gets unregistered from lkey_table and then rvt_check_refs() is
called to wait for in-flight usages before the rvt_mregion is freed.
rvt_check_refs() seems to have a couple issues.
* It has a fast exit path which tests percpu_ref_is_zero(). However,
a percpu_ref reading zero doesn't mean that the object can be
released. In fact, the ->release() callback might not even have
started executing yet. Proceeding with freeing can lead to
use-after-free.
* lkey_table is RCU protected but there is no RCU grace period in the
free path. percpu_ref uses RCU internally but it's sched-RCU whose
grace periods are different from regular RCU. Also, it generally
isn't a good idea to depend on internal behaviors like this.
To address the above issues, this patch removes the fast exit and adds
an explicit synchronize_rcu().
Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Cc: Mike Marciniszyn <mike.marciniszyn@intel.com> Cc: linux-rdma@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Yixian Liu [Thu, 15 Mar 2018 07:23:14 +0000 (15:23 +0800)]
RDMA/hns: Fix cqn type and init resp
This patch changes the type of cqn from u32 to u64 to keep
userspace and kernel consistent, initializes resp both for
cq and qp to zeros, and also changes the condition judgment
of outlen considering future caps extension.
Suggested-by: Jason Gunthorpe <jgg@mellanox.com> Fixes: e088a685eae9 (hns: Support rq record doorbell for the user space) Fixes: 9b44703d0a21 (hns: Support cq record doorbell for the user space) Signed-off-by: Yixian Liu <liuyixian@huawei.com> Signed-off-by: Lijun Ou <oulijun@huawei.com> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@huawei.com> Signed-off-by: Shaobo Xu <xushaobo2@huawei.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 13 Mar 2018 14:06:20 +0000 (16:06 +0200)]
IB/core: Refactor ib_init_ah_attr_from_path() for RoCE
Resolving route for RoCE for a path record is needed only for the
received CM requests.
Therefore,
(a) ib_init_ah_attr_from_path() is refactored first to isolate the
code of resolving route.
(b) Setting dlid, path bits is not needed for RoCE.
Additionally ah attribute initialization is done from the path record
entry, so it is better to refer to path record entry type for
different link layer instead of ah attribute type while initializing
ah attribute itself.
Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 13 Mar 2018 14:06:17 +0000 (16:06 +0200)]
IB/ocrdma: Removed GID add/del null routines
add_gid() and del_gid() are optional callback routines.
ib_core ignores invoking them while updating GID table entries if
they are not implemented by provider drivers. Therefore remove them.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 13 Mar 2018 14:06:15 +0000 (16:06 +0200)]
IB/cma: Use rdma_protocol_roce() and remove cma_protocol_roce_dev_port()
rdma_protocol_roce() API from the ib_core already provides a way to
detect whether a given device+port is RoCE or not.
Therefore, make use of it and avoid implementing it again in rdmacm
module.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 13 Mar 2018 14:06:13 +0000 (16:06 +0200)]
IB/core: Honor return status of ib_init_ah_from_mcmember()
The return status of ib_init_ah_from_mcmember() is ignored by
cma_ib_mc_handler(). Honor it and return error event if ah attribute
initialization failed.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 13 Mar 2018 14:06:12 +0000 (16:06 +0200)]
IB/{core, ipoib}: Simplify ib_find_gid() for unused ndev
ib_find_gid() is only used by IPoIB driver. For IB link layer, GID table
entries are not based on netdevice. Netdevice parameter is unused here.
Therefore, it is removed.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Reviewed-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Parav Pandit [Tue, 13 Mar 2018 14:06:11 +0000 (16:06 +0200)]
IB/core: Fix comments of GID query functions
Exported symbol's comments should be with function definition and not in
the header file. Therefore comments of ib_find_cached_gid() and
ib_find_cached_gid_by_port() functions are moved closer to their
definitions.
The function name in then comment is different than the actual function
name, fix it to be same as ib_cache_gid_find_by_filter().
Also current comment section of ib_find_cached_gid_by_port() contains the
desciption of ib_find_cached_gid(), fix that as well.
Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Doug Ledford [Wed, 14 Mar 2018 22:49:12 +0000 (18:49 -0400)]
Merge branch 'k.o/wip/dl-for-rc' into k.o/wip/dl-for-next
Due to bug fixes found by the syzkaller bot and taken into the for-rc
branch after development for the 4.17 merge window had already started
being taken into the for-next branch, there were fairly non-trivial
merge issues that would need to be resolved between the for-rc branch
and the for-next branch. This merge resolves those conflicts and
provides a unified base upon which ongoing development for 4.17 can
be based.
Conflicts:
drivers/infiniband/hw/mlx5/main.c - Commit 42cea83f9524
(IB/mlx5: Fix cleanup order on unload) added to for-rc and
commit b5ca15ad7e61 (IB/mlx5: Add proper representors support)
add as part of the devel cycle both needed to modify the
init/de-init functions used by mlx5. To support the new
representors, the new functions added by the cleanup patch
needed to be made non-static, and the init/de-init list
added by the representors patch needed to be modified to
match the init/de-init list changes made by the cleanup
patch.
Updates:
drivers/infiniband/hw/mlx5/mlx5_ib.h - Update function
prototypes added by representors patch to reflect new function
names as changed by cleanup patch
drivers/infiniband/hw/mlx5/ib_rep.c - Update init/de-init
stage list to match new order from cleanup patch
Arnd Bergmann [Tue, 20 Feb 2018 20:56:27 +0000 (21:56 +0100)]
infiniband: bnxt_re: use BIT_ULL() for 64-bit bit masks
On 32-bit targets, we otherwise get a warning about an impossible constant
integer expression:
In file included from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/infiniband/hw/bnxt_re/ib_verbs.c:39:
drivers/infiniband/hw/bnxt_re/ib_verbs.c: In function 'bnxt_re_query_device':
include/linux/bitops.h:7:24: error: left shift count >= width of type [-Werror=shift-count-overflow]
#define BIT(nr) (1UL << (nr))
^~
drivers/infiniband/hw/bnxt_re/bnxt_re.h:61:34: note: in expansion of macro 'BIT'
#define BNXT_RE_MAX_MR_SIZE_HIGH BIT(39)
^~~
drivers/infiniband/hw/bnxt_re/bnxt_re.h:62:30: note: in expansion of macro 'BNXT_RE_MAX_MR_SIZE_HIGH'
#define BNXT_RE_MAX_MR_SIZE BNXT_RE_MAX_MR_SIZE_HIGH
^~~~~~~~~~~~~~~~~~~~~~~~
drivers/infiniband/hw/bnxt_re/ib_verbs.c:149:25: note: in expansion of macro 'BNXT_RE_MAX_MR_SIZE'
ib_attr->max_mr_size = BNXT_RE_MAX_MR_SIZE;
^~~~~~~~~~~~~~~~~~~
Fixes: 872f3578241d ("RDMA/bnxt_re: Add support for MRs with Huge pages") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Arnd Bergmann [Tue, 20 Feb 2018 20:56:26 +0000 (21:56 +0100)]
infiniband: qplib_fp: fix pointer cast
Building for a 32-bit target results in a couple of warnings from casting
between a 32-bit pointer and a 64-bit integer:
drivers/infiniband/hw/bnxt_re/qplib_fp.c: In function 'bnxt_qplib_service_nq':
drivers/infiniband/hw/bnxt_re/qplib_fp.c:333:23: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
bnxt_qplib_arm_srq((struct bnxt_qplib_srq *)q_handle,
^
drivers/infiniband/hw/bnxt_re/qplib_fp.c:336:12: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
(struct bnxt_qplib_srq *)q_handle,
^
In file included from include/linux/byteorder/little_endian.h:5,
from arch/arm/include/uapi/asm/byteorder.h:22,
from include/asm-generic/bitops/le.h:6,
from arch/arm/include/asm/bitops.h:342,
from include/linux/bitops.h:38,
from include/linux/kernel.h:11,
from include/linux/interrupt.h:6,
from drivers/infiniband/hw/bnxt_re/qplib_fp.c:39:
drivers/infiniband/hw/bnxt_re/qplib_fp.c: In function 'bnxt_qplib_create_srq':
include/uapi/linux/byteorder/little_endian.h:31:43: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
#define __cpu_to_le64(x) ((__force __le64)(__u64)(x))
^
include/linux/byteorder/generic.h:86:21: note: in expansion of macro '__cpu_to_le64'
#define cpu_to_le64 __cpu_to_le64
^~~~~~~~~~~~~
drivers/infiniband/hw/bnxt_re/qplib_fp.c:569:19: note: in expansion of macro 'cpu_to_le64'
req.srq_handle = cpu_to_le64(srq);
Using a uintptr_t as an intermediate works on all architectures.
Fixes: 37cb11acf1f7 ("RDMA/bnxt_re: Add SRQ support for Broadcom adapters") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Andrew Morton [Tue, 13 Mar 2018 22:06:45 +0000 (15:06 -0700)]
drivers/infiniband/ulp/srpt/ib_srpt.c: fix build with gcc-4.4.4
gcc-4.4.4 has issues with initialization of anonymous unions:
drivers/infiniband/ulp/srpt/ib_srpt.c: In function 'srpt_zerolength_write':
drivers/infiniband/ulp/srpt/ib_srpt.c:854: error: unknown field 'wr_cqe' specified in initializer
drivers/infiniband/ulp/srpt/ib_srpt.c:854: warning: initialization makes integer from pointer without a cast
Work aound this.
Fixes: 2a78cb4db487 ("IB/srpt: Fix an out-of-bounds stack access in srpt_zerolength_write()") Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Andrew Morton [Tue, 13 Mar 2018 21:51:57 +0000 (14:51 -0700)]
drivers/infiniband/core/verbs.c: fix build with gcc-4.4.4
gcc-4.4.4 has issues with initialization of anonymous unions.
drivers/infiniband/core/verbs.c: In function '__ib_drain_sq':
drivers/infiniband/core/verbs.c:2204: error: unknown field 'wr_cqe' specified in initializer
drivers/infiniband/core/verbs.c:2204: warning: initialization makes integer from pointer without a cast
Work around this.
Fixes: a1ae7d0345edd5 ("RDMA/core: Avoid that ib_drain_qp() triggers an out-of-bounds stack access") Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Steve Wise <swise@opengridcomputing.com> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Jason Gunthorpe <jgg@mellanox.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Mark Bloch [Wed, 14 Mar 2018 07:14:15 +0000 (09:14 +0200)]
IB/mlx5: Fix cleanup order on unload
On load we create private CQ/QP/PD in order to be used by UMR, we create
those resources after we register ourself as an IB device, and we destroy
them after we unregister as an IB device. This was changed by commit 16c1975f1032 ("IB/mlx5: Create profile infrastructure to add and remove
stages") which moved the destruction before we unregistration. This
allowed to trigger an invalid memory access when unloading mlx5_ib while
there are open resources:
We restore the original behavior by breaking the UMR stage into two parts,
pre and post IB registration stages, this way we can restore the original
functionality and maintain clean separation of logic between stages.
Fixes: 16c1975f1032 ("IB/mlx5: Create profile infrastructure to add and remove stages") Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Martin Wilck [Wed, 14 Feb 2018 20:45:43 +0000 (21:45 +0100)]
rdma_rxe: make rxe work over 802.1q VLAN devices
This patch fixes RDMA/rxe over 802.1q VLAN devices.
Without it, I observed the following behavior:
a) adding a VLAN device to RXE via rxe_net_add() creates a non-functional
RDMA device. This is caused by the logic in enum_all_gids_of_dev_cb() /
is_eth_port_of_netdev(), which only considers networks connected to
"upper devices" of the configured network device, resulting in an empty
set of gids for a VLAN interface that is an "upper device" itself.
Later attempts to connect via this rdma device fail in cma_acuire_dev()
because no gids can be resolved.
b) adding the master device of the VLAN device instead seems to work
initially, target addresses via VLAN devices are resolved successfully.
But the connection times out because no 802.1q VLAN headers are
inserted in the ethernet packets, which are therefore never received.
This happens because the RXE layer sends the packets via the master
device rather than the VLAN device.
The problem could be solved by changing either a) or b). My thinking was
that the logic in a) was created deliberately, thus I decided to work on
b). It turns out that the information about the VLAN interface for the gid
at hand is available in the AV information. My patch converts the RXE code
to use this netdev instead of rxe->ndev. With this change, RXE over vlan
works on my test system.
Signed-off-by: Martin Wilck <mwilck@suse.com> Reviewed-by: Moni Shoua <monis@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Tue, 13 Mar 2018 16:37:27 +0000 (18:37 +0200)]
RDMA/ucma: Don't allow join attempts for unsupported AF family
Users can provide garbage while calling to ucma_join_ip_multicast(),
it will indirectly cause to rdma_addr_size() return 0, making the
call to ucma_process_join(), which had the right checks, but it is
better to check the input as early as possible.
Fixes: 5bc2b7b397b0 ("RDMA/ucma: Allow user space to specify AF_IB when joining multicast") Reported-by: <syzbot+2287ac532caa81900a4e@syzkaller.appspotmail.com> Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Arnd Bergmann [Tue, 13 Mar 2018 12:06:20 +0000 (13:06 +0100)]
RDMA/i40iw: include linux/irq.h
We get a build failure on ARM unless the header is included explicitly:
drivers/infiniband/hw/i40iw/i40iw_verbs.c: In function 'i40iw_get_vector_affinity':
drivers/infiniband/hw/i40iw/i40iw_verbs.c:2747:9: error: implicit declaration of function 'irq_get_affinity_mask'; did you mean 'irq_create_affinity_masks'? [-Werror=implicit-function-declaration]
return irq_get_affinity_mask(msix_vec->irq);
^~~~~~~~~~~~~~~~~~~~~
irq_create_affinity_masks
drivers/infiniband/hw/i40iw/i40iw_verbs.c:2747:9: error: returning 'int' from a function with return type 'const struct cpumask *' makes pointer from integer without a cast [-Werror=int-conversion]
return irq_get_affinity_mask(msix_vec->irq);
Ilya Lesokhin [Tue, 13 Mar 2018 13:18:48 +0000 (15:18 +0200)]
IB/mlx5: Maintain a single emergency page
The mlx5 driver needs to be able to issue invalidation to ODP MRs
even if it cannot allocate memory. To this end it preallocates
emergency pages to use when the situation arises.
This flow should be extremely rare enough, that we don't need
to worry about contention and therefore a single emergency page
is good enough.
Daniel Jurgens [Tue, 13 Mar 2018 13:18:47 +0000 (15:18 +0200)]
IB/mlx5: Only synchronize RCU once when removing mkeys
Instead synchronizing RCU in a loop when removing mkeys in a batch do it
once at the end before freeing them. The result is only waiting for one
RCU grace period instead of many serially.
Signed-off-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Mark Bloch [Tue, 13 Mar 2018 13:18:46 +0000 (15:18 +0200)]
IB/mlx5: Expose more priorities for bypass namespace
BYPASS namespace is used by the RDMA side to insert flow rules into
the vport RX flow tables. Currently only 8 priorities are exposed,
increase this to 16 to allow more flexibility. This change will also
cause the BYPASS namespace to use 32 levels (as apposed to 16 today) of
flow tables, 16 levels for regular rules and 16 for don't trap rules.
Reviewed-by: Maor Gottlieb <maorg@mellanox.com> Signed-off-by: Mark Bloch <markb@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Tatyana Nikolova [Mon, 12 Mar 2018 22:14:02 +0000 (17:14 -0500)]
RDMA/core: Do not use invalid destination in determining port reuse
cma_port_is_unique() allows local port reuse if the quad (source
address and port, destination address and port) for this connection
is unique. However, if the destination info is zero or unspecified, it
can't make a correct decision but still allows port reuse. For example,
sometimes rdma_bind_addr() is called with unspecified destination and
reusing the port can lead to creating a connection with a duplicate quad,
after the destination is resolved. The issue manifests when MPI scale-up
tests hang after the duplicate quad is used.
Set the destination address family and add checks for zero destination
address and port to prevent source port reuse based on invalid destination.
Fixes: 19b752a19dce ("IB/cma: Allow port reuse for rdma_id") Reviewed-by: Sean Hefty <sean.hefty@intel.com> Signed-off-by: Tatyana Nikolova <tatyana.e.nikolova@intel.com> Signed-off-by: Shiraz Saleem <shiraz.saleem@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Bart Van Assche [Mon, 12 Mar 2018 20:55:55 +0000 (13:55 -0700)]
IB/srp: Fix IPv6 address parsing
Split IPv6 addresses at the colon that separates the IPv6 address
and the port number instead of at a colon in the middle of the IPv6
address. Check whether the IPv6 address is surrounded with square
brackets.
Leon Romanovsky [Mon, 12 Mar 2018 19:26:37 +0000 (21:26 +0200)]
RDMA/mlx5: Fix crash while accessing garbage pointer and freed memory
The failure in rereg_mr flow caused to set garbage value (error value)
into mr->umem pointer. This pointer is accessed at the release stage
and it causes to the following crash.
There is not enough to simply change umem to point to NULL, because the
MR struct is needed to be accessed during MR deregistration phase, so
delay kfree too.
Leon Romanovsky [Sun, 11 Mar 2018 11:51:35 +0000 (13:51 +0200)]
RDMA/verbs: Simplify modify QP check
All callers to ib_modify_qp_is_ok() provides enum ib_qp_state
makes the checks of out-of-scope redundant. Let's remove them
together with updating function signature to return boolean result.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Sun, 11 Mar 2018 11:51:33 +0000 (13:51 +0200)]
RDMA/uverbs: Ensure validity of current QP state value
The QP state is internal enum which is checked at the driver
level by calling to ib_modify_qp_is_ok(). Move this check closer
to user and leave kernel users to be checked by compiler.
Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Sun, 11 Mar 2018 11:51:32 +0000 (13:51 +0200)]
RDMA/mlx5: Fix NULL dereference while accessing XRC_TGT QPs
mlx5 modify_qp() relies on FW that the error will be thrown if wrong
state is supplied. The missing check in FW causes the following crash
while using XRC_TGT QPs.
Memory state around the buggy address: ffff880066b99180: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff880066b99200: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>ffff880066b99280: 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc fc
^ ffff880066b99300: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff880066b99380: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Cc: syzkaller <syzkaller@googlegroups.com> Fixes: 0fb2ed66a14c ("IB/mlx5: Add create and destroy functionality for Raw Packet QP") Signed-off-by: Boris Pismenny <borisp@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Doug Ledford [Tue, 13 Mar 2018 19:49:34 +0000 (15:49 -0400)]
Merge tag 'mlx5-updates-2018-02-28-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-28-2 (IPSec-2)
This series follows our previous one to lay out the foundations for IPSec
in user-space and extend current kernel netdev IPSec support. As noted in
our previous pull request cover letter "mlx5-updates-2018-02-28-1 (IPSec-1)",
the IPSec mechanism will be supported through our flow steering mechanism.
Therefore, we need to change the initialization order. Furthermore, IPsec
is also supported in both egress and ingress. Since our current flow
steering is egress only, we add an empty (only implemented through FPGA
steering ops) egress namespace to handle that case. We also implement
the required flow steering callbacks and logic in our FPGA driver.
We extend the FPGA support for ESN and modifying a xfrm too. Therefore, we
add support for some new FPGA command interface that supports them. The
other required bits are added too. The new features and requirements are
advertised via cap bits.
Last but not least, we revise our driver's accel_esp API. This API will be
shared between our netdev and IB driver, so we need to have all the required
functionality from both worlds.
Doug Ledford [Fri, 9 Mar 2018 23:07:46 +0000 (18:07 -0500)]
Revert "RDMA/mlx5: Fix integer overflow while resizing CQ"
The original commit of this patch has a munged log message that is
missing several of the tags the original author intended to be on the
patch. This was due to patchworks misinterpreting a cut-n-paste
separator line as an end of message line and munging the mbox that was
used to import the patch:
Steve Wise [Thu, 1 Mar 2018 21:58:20 +0000 (13:58 -0800)]
mlx4_ib: zero out struct ib_pd when allocating
Zero out the fields of the struct ib_pd for user mode pds so that
users querying pds via nldev will not get garbage. For simplicity,
use kzalloc() to allocate the mlx4_ib_pd struct.
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 1 Mar 2018 21:57:44 +0000 (13:57 -0800)]
RDMA/nldev: provide detailed CM_ID information
Implement RDMA nldev netlink interface to get detailed CM_ID information.
Because cm_id's are attached to rdma devices in various work queue
contexts, the pid and task information at restrak_add() time is sometimes
not useful. For example, an nvme/f host connection cm_id ends up being
bound to a device in a work queue context and the resulting pid at attach
time no longer exists after connection setup. So instead we mark all
cm_id's created via the rdma_ucm as "user", and all others as "kernel".
This required tweaking the restrack code a little. It also required
wrapping some rdma_cm functions to allow passing the module name string.
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Steve Wise [Thu, 1 Mar 2018 21:57:29 +0000 (13:57 -0800)]
RDMA/nldev: common resource dumpit function
Create a common dumpit function that can be used by all common resource
types. This reduces code replication and simplifies the code as we add
more resource types.
Signed-off-by: Steve Wise <swise@opengridcomputing.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
The kernel compiled with CONFIG_REFCOUNT_FULL produces the following
error. The reason to it that initial value of refcount_t is supposed
to be more than 0, change it.
Aviad Yehezkel [Thu, 18 Jan 2018 14:02:17 +0000 (16:02 +0200)]
net/mlx5: IPSec, Add support for ESN
Currently ESN is not supported with IPSec device offload.
This patch adds ESN support to IPsec device offload.
Implementing new xfrm device operation to synchronize offloading device
ESN with xfrm received SN. New QP command to update SA state at the
following:
^ - marks where QP command invoked to update the SA ESN state
machine.
| - marks the start of the ESN scope (0-2^32-1). At this point move SA
ESN overlap bit to zero and increment ESN.
* - marks the middle of the ESN scope (2^31). At this point move SA
ESN overlap bit to one.
Aviad Yehezkel [Sun, 18 Feb 2018 13:07:20 +0000 (15:07 +0200)]
net/mlx5: Add flow-steering commands for FPGA IPSec implementation
In order to add a context to the FPGA, we need to get both the software
transform context (which includes the keys, etc) and the
source/destination IPs (which are included in the steering
rule). Therefore, we register new set of firmware like commands for
the FPGA. Each time a rule is added, the steering core infrastructure
calls the FPGA command layer. If the rule is intended for the FPGA,
it combines the IPs information with the software transformation
context and creates the respective hardware transform.
Afterwards, it calls the standard steering command layer.
Aviad Yehezkel [Thu, 18 Jan 2018 11:05:48 +0000 (13:05 +0200)]
net/mlx5: Refactor accel IPSec code
The current code has one layer that executed FPGA commands and
the Ethernet part directly used this code. Since downstream patches
introduces support for IPSec in mlx5_ib, we need to provide some
abstractions. This patch refactors the accel code into one layer
that creates a software IPSec transformation and another one which
creates the actual hardware context.
The internal command implementation is now hidden in the FPGA
core layer. The code also adds the ability to share FPGA hardware
contexts. If two contexts are the same, only a reference count
is taken.
Aviad Yehezkel [Tue, 16 Jan 2018 14:12:22 +0000 (16:12 +0200)]
net/mlx5: IPSec, Add command V2 support
This patch adds V2 command support.
New fpga devices support extended features (udp encap, esn etc...), this
features require new hardware sadb format therefore we have a new version
of commands to manipulate it.
Yossi Kuperman [Sun, 22 Oct 2017 16:45:45 +0000 (19:45 +0300)]
net/mlx5e: IPSec, Add support for ESP trailer removal by hardware
Current hardware decrypts and authenticates incoming ESP packets.
Subsequently, the software extracts the nexthdr field, truncates the
trailer and adjusts csum accordingly.
With this patch and a capable device, the trailer is being removed
by the hardware and the nexthdr field is conveyed via PET. This way
we avoid both the need to access the trailer (cache miss) and to
compute its relative checksum, which significantly improve
the performance.
Experiment shows that trailer removal improves the performance by
2Gbps, (netperf). Both forwarding and host-to-host configurations.
Yossi Kuperman [Sun, 22 Oct 2017 16:43:58 +0000 (19:43 +0300)]
net/mlx5: IPSec, Generalize sandbox QP commands
The current code assume only SA QP commands.
Refactor in order to pave the way for new QP commands:
1. Generic cmd response format.
2. SA cmd checks are in dedicated functions.
3. Aligned debug prints.
Doug Ledford [Wed, 7 Mar 2018 20:40:29 +0000 (15:40 -0500)]
Merge tag 'mlx5-updates-2018-02-28-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux into k.o/wip/dl-for-next
mlx5-updates-2018-02-28-1 (IPSec-1)
This series consists of some fixes and refactors for the mlx5 drivers,
especially around the FPGA and flow steering. Most of them are trivial
fixes and are the foundation of allowing IPSec acceleration from user-space.
We use flow steering abstraction in order to accelerate IPSec packets.
When a user creates a steering rule, [s]he states that we'll carry an
encrypt/decrypt flow action (using a specific configuration) for every
packet which conforms to a certain match. Since currently offloading these
packets is done via FPGA, we'll add another set of flow steering ops.
These ops will execute the required FPGA commands and then call the
standard steering ops.
In order to achieve this, we need that the commands will get all the
required information. Therefore, we pass the fte object and embed the
flow_action struct inside the fte. In addition, we add the shim layer
that will later be used for alternating between the standard and the
FPGA steering commands.
Some fixes, like " net/mlx5e: Wait for FPGA command responses with a timeout"
are very relevant for user-space applications, as these applications could
be killed, but we still want to wait for the FPGA and update the kernel's
database.
Zhu Yanjun [Tue, 27 Feb 2018 11:04:32 +0000 (06:04 -0500)]
IB/rxe: remove unnecessary skb_clone
In send_atomic_ack function, it is not necessary to make a
skb_clone. To gain better performance (high throughput and
low latency), this skb_clone is removed.
The following tests are made.
server client
--------- ---------
|1.1.1.1|<----rxe-channel--->|1.1.1.2|
--------- ---------
On server: rping -s -a 1.1.1.1 -v -C 1000 -S 512
On client: rping -c -a 1.1.1.1 -v -C 1000 -S 512
The kernel config CONFIG_DEBUG_KMEMLEAK is enabled on both server
and client.
This test runs for several hours. There is no memory leak and the whole
system can work well.
Based on the above network, the following tests are made.
Bart Van Assche [Fri, 2 Mar 2018 21:14:15 +0000 (13:14 -0800)]
IB/srpt: Add RDMA/CM support
Add a parameter for configuring the port on which the ib_srpt driver
listens for incoming RDMA/CM connections, namely
/sys/kernel/config/target/srpt/discovery_auth/rdma_cm_port. The default
value for this parameter is 0 which means "do not listen for incoming
RDMA/CM connections". Add RDMA/CM support to all code that handles
connection state changes. Modify srpt_init_nodeacl() such that ACLs can
be configured for IPv4 and IPv6 addresses.
Note: incoming connection requests are only accepted for ports that
have been enabled. See also the "if (!sport->enabled)" code in the
connection request handler. See also the following configfs attribute:
/sys/kernel/config/target/srpt/$port/$port/enable.
Signed-off-by: Bart Van Assche <bart.vanassche@wdc.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Leon Romanovsky [Wed, 7 Mar 2018 12:49:09 +0000 (14:49 +0200)]
RDMA/ucma: Limit possible option size
Users of ucma are supposed to provide size of option level,
in most paths it is supposed to be equal to u8 or u16, but
it is not the case for the IB path record, where it can be
multiple of struct ib_path_rec_data.
This patch takes simplest possible approach and prevents providing
values more than possible to allocate.
Reported-by: syzbot+a38b0e9f694c379ca7ce@syzkaller.appspotmail.com Fixes: 7ce86409adcd ("RDMA/ucma: Allow user space to set service type") Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
Parav Pandit [Wed, 7 Mar 2018 06:07:41 +0000 (08:07 +0200)]
IB/core: Fix possible crash to access NULL netdev
resolved_dev returned might be NULL as ifindex is transient number.
Ignoring NULL check of resolved_dev might crash the kernel.
Therefore perform NULL check before accessing resolved_dev.
Additionally rdma_resolve_ip_route() invokes addr_resolve() which
performs check and address translation for loopback ifindex.
Therefore, checking it again in rdma_resolve_ip_route() is not helpful.
Therefore, the code is simplified to avoid IFF_LOOPBACK check.
Fixes: 200298326b27 ("IB/core: Validate route when we init ah") Reviewed-by: Daniel Jurgens <danielj@mellanox.com> Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Leon Romanovsky <leon@kernel.org> Signed-off-by: Doug Ledford <dledford@redhat.com>
Aviad Yehezkel [Sun, 18 Feb 2018 13:00:54 +0000 (15:00 +0200)]
net/mlx5: Flow steering cmd interface should get the fte when deleting
Previously, deleting a flow steering entry only got the index.
Since the FPGA implementation of FTE's deletion might need to dig
inside the FTE itself, we would like to get the FTE's context.
Changing the interface to pass the FTE context.