Scott Peterson [Thu, 13 Apr 2017 08:45:44 +0000 (04:45 -0400)]
i40e/i40evf: Add tracepoints
This patch adds tracepoints to the i40e and i40evf drivers to which
BPF programs can be attached for feature testing and verification.
It's expected that an attached BPF program will identify and count or
log some interesting subset of traffic. The bcc-tools package is
helpful there for containing all the BPF arcana in a handy Python
wrapper. Though you can make these tracepoints log trace messages, the
messages themselves probably won't be very useful (other to verify the
tracepoint is being called while you're debugging your BPF program).
The idea here is that tracepoints have such low performance cost when
disabled that we can leave these in the upstream drivers. This may
eventually enable the instrumentation of unmodified customer systems
should the need arise to verify a NIC feature is working as expected.
In general this enables one set of feature verification tools to be
used on these drivers whether they're built with the kernel or
separately.
Users are advised against using these tracepoints for anything other
than a diagnostic tool. They have a performance impact when enabled,
and their exact placement and form may change as we see how well they
work in practice for the purposes above.
Change-ID: Id6014a7322c0e6d08068114dd20bd156f2f6435e Signed-off-by: Scott Peterson <scott.d.peterson@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Mitch Williams [Wed, 12 Apr 2017 11:16:52 +0000 (07:16 -0400)]
i40e: dump VF information in debugfs
Dump some internal state about VFs through debugfs. This provides
information not available with 'ip link show'. To use, write "dump vf
<id>" to the command file, or just "dump vf" to dump information on all
of the VFs.
Change-ID: Ibe32b7f4ae55d4358c0b903217475f708ada1ecd Signed-off-by: Mitch Williams <mitch.a.williams@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Alexander Duyck [Mon, 10 Apr 2017 09:18:43 +0000 (05:18 -0400)]
i40e: Fix support for flow director programming status
This patch fixes an issue I introduced when I converted the code over to
using the length field to determine if a descriptor was done or not. It
turns out that we are also processing programming descriptors in the Rx
path and need to have these processed even though the length field will be
0 on these packets. What will happen with a programming descriptor is that
we will receive a descriptor that has the SPH bit set, and the header
length and packet length fields cleared.
To account for this we should be checking for the bit for split header
being set even though we aren't actually using header split. This bit is
set in the length field to indicate if a programming descriptor response is
contained in the descriptor. Since we don't support header split we don't
need to perform the extra checks of using a fixed value for the entire
length field.
In addition I am moving the function for checking if a filter is a
programming status filter into the i40e_txrx.c file since there is no
longer support for FCoE it doesn't make sense to keep this file in i40e.h.
Change-ID: I12c359c3dc70adb9d6b92b27324bb2c7f04c1a06 Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
alice michael [Thu, 6 Apr 2017 09:59:34 +0000 (05:59 -0400)]
i40e/i40evf: Remove VF Rx csum offload for tunneled packets
Rx checksum offload for tunneled packets was never being negotiated or
requested by VF. This capability was assumed by default and enabled in
current hardware for VF. Going forward, this feature needs to be disabled
or advanced ptypes should be negotiated with PF in the future.
Change-ID: I9e54cfa8a90e03ab6956db4412f1e337ccd2c2e0 Signed-off-by: Preethi Banala <preethi.banala@intel.com> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
i40evf: Use net_device_stats from struct net_device
Instead of using a private copy of struct net_device_stats in
struct i40evf_adapter, use stats from struct net_device. Also remove the
now unnecessary .ndo_get_stats function.
Signed-off-by: Tobias Klauser <tklauser@distanz.ch> Tested-by: Andrew Bowers <andrewx.bowers@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
be2net: VxLAN offload should be re-enabled when only 1 UDP port is left
We disable VxLAN offload when more than 1 UDP port is added to the driver,
since Skyhawk doesn't support offload with multiple ports. The existing
driver design expects the user to delete all port configurations and create
a configuration with a single UDP port for VxLAN offload to be re-enabled.
Remove this restriction by tracking the ports added and re-enabling offload
when ports get deleted and only 1 port is left.
Signed-off-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com> Reviewed-by: Sathya Perla <sathya.perla@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 18 Apr 2017 18:11:10 +0000 (14:11 -0400)]
Merge branch 'ftgmac100-batch5-features'
Benjamin Herrenschmidt says:
====================
ftgmac100: Rework batch 5 - Features
This is the third spin of the fifth and last batch of
updates to the ftgmac100 driver.
This contains a few additional "features" such as:
- Support for ethtool n-way reset
- Multicast filtering & promisc support
- Vlan offload
- netpoll
And a couple of misc bits. This also adds the device-tree binding
documentation.
v2. - Addresses review comments and adds a new patch fixing a
theorical ordering issue in my new NAPI poll implementation
- Add a bug fix (Patch 8/9) for a potential ordering issue
in the new NAPI poll code.
v3. - Rebase on net-next (fix conflict with an unrelated #include
change series)
- Update DT bindings better describing accepted phy-mode values
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
ftgmac100: Fix potential ordering issue in NAPI poll
We need to ensure the loads from the descriptor are done after the
MMIO store clearing the interrupts has completed, otherwise we
might still miss work.
A read back from the MMIO register will "push" the posted store and
ioread32 has a barrier on weakly aordered architectures that will
order subsequent accesses.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ftgmac100: Add pause frames configuration and support
Hopefully my understanding of how the hardware works is correct,
as the documentation isn't completely clear. So far I have seen
no obvious issue. Pause seem to also work with NC-SI.
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 12:32:14 +0000 (14:32 +0200)]
net: pxa168_eth: Use kcalloc() in two functions
Multiplications for the size determination of memory allocations
indicated that array data structures should be processed.
Thus use the corresponding function "kcalloc".
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 08:52:02 +0000 (10:52 +0200)]
net: mvpp2: Fix a jump label position in mvpp2_rx()
The script "checkpatch.pl" pointed out that labels should not be indented.
Thus delete two horizontal tabs before the jump label "err_drop_frame"
in the function "mvpp2_rx".
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 08:40:32 +0000 (10:40 +0200)]
net: mvpp2: Improve a size determination in two functions
Replace the specification of two data structures by pointer dereferences
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 08:30:29 +0000 (10:30 +0200)]
net: mvpp2: Improve 27 size determinations
Replace the specification of data structures by references to
a local variable as the parameter for the operator "sizeof"
to make the corresponding size determination a bit safer.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 07:12:34 +0000 (09:12 +0200)]
net: mvpp2: Improve another size determination in mvpp2_prs_default_init()
Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 07:06:33 +0000 (09:06 +0200)]
net: mvpp2: Improve another size determination in mvpp2_bm_init()
Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 06:55:42 +0000 (08:55 +0200)]
net: mvpp2: Improve another size determination in mvpp2_port_probe()
Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 06:48:23 +0000 (08:48 +0200)]
net: mvpp2: Improve another size determination in mvpp2_init()
Replace the specification of a data structure by a pointer dereference
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 06:38:32 +0000 (08:38 +0200)]
net: mvpp2: Improve two size determinations in mvpp2_probe()
Replace the specification of two data structures by pointer dereferences
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Mon, 17 Apr 2017 06:09:07 +0000 (08:09 +0200)]
net: mvpp2: Use kmalloc_array() in mvpp2_txq_init()
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kmalloc_array".
This issue was detected by using the Coccinelle software.
* Replace the specification of a data structure by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Sun, 16 Apr 2017 20:11:22 +0000 (22:11 +0200)]
net: mvneta: Use kmalloc_array() in mvneta_txq_init()
A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "kmalloc_array".
This issue was detected by using the Coccinelle software.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Sun, 16 Apr 2017 19:45:38 +0000 (21:45 +0200)]
net: mvneta: Improve two size determinations in mvneta_init()
Replace the specification of two data structures by pointer dereferences
as the parameter for the operator "sizeof" to make the corresponding size
determination a bit safer according to the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Markus Elfring [Sun, 16 Apr 2017 19:23:19 +0000 (21:23 +0200)]
net: mvneta: Use devm_kmalloc_array() in mvneta_init()
* A multiplication for the size determination of a memory allocation
indicated that an array data structure should be processed.
Thus use the corresponding function "devm_kmalloc_array".
This issue was detected by using the Coccinelle software.
* Replace the specification of a data type by a pointer dereference
to make the corresponding size determination a bit safer according to
the Linux coding style convention.
Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: David S. Miller <davem@davemloft.net>
commit 83e7e4ce9e93c3 ("mac80211: Use rhltable instead of rhashtable")
removed the last user that made use of 'insecure_elasticity' parameter,
i.e. the default of 16 is used everywhere.
Replace it with a constant.
Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 15 Apr 2017 14:00:29 +0000 (22:00 +0800)]
sctp: process duplicated strreset asoc request correctly
This patch is to fix the replay attack issue for strreset asoc requests.
When a duplicated strreset asoc request is received, reply it with bad
seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with the
result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.
But note that if the result saved in asoc is performed, the sender's next
tsn and receiver's next tsn for the response chunk should be set. It's
safe to get them from asoc. Because if it's changed, which means the peer
has received the response already, the new response with wrong tsn won't
be accepted by peer.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 15 Apr 2017 14:00:28 +0000 (22:00 +0800)]
sctp: process duplicated strreset in and addstrm in requests correctly
This patch is to fix the replay attack issue for strreset and addstrm in
requests.
When a duplicated strreset in or addstrm in request is received, reply it
with bad seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with
the result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.
For strreset in or addstrm in request, if the receiver side processes it
successfully, a strreset out or addstrm out request(as a response for that
request) will be sent back to peer. reconf_time will retransmit the out
request even if it's lost.
So when receiving a duplicated strreset in or addstrm in request and it's
result was performed, it shouldn't reply this request, but drop it instead.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 15 Apr 2017 14:00:27 +0000 (22:00 +0800)]
sctp: process duplicated strreset out and addstrm out requests correctly
Now sctp stream reconf will process a request again even if it's seqno is
less than asoc->strreset_inseq.
If one request has been done successfully and some data chunks have been
accepted and then a duplicated strreset out request comes, the streamin's
ssn will be cleared. It will cause that stream will never receive chunks
any more because of unsynchronized ssn. It allows a replay attack.
A similar issue also exists when processing addstrm out requests. It will
cause more extra streams being added.
This patch is to fix it by saving the last 2 results into asoc. When a
duplicated strreset out or addstrm out request is received, reply it with
bad seqno if it's seqno < asoc->strreset_inseq - 2, and reply it with the
result saved in asoc if it's seqno >= asoc->strreset_inseq - 2.
Note that it saves last 2 results instead of only last 1 result, because
two requests can be sent together in one chunk.
And note that when receiving a duplicated request, the receiver side will
still reply it even if the peer has received the response. It's safe, As
the response will be dropped by the peer.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Chonggang Li [Sun, 16 Apr 2017 19:02:18 +0000 (12:02 -0700)]
bonding: deliver link-local packets with skb->dev set to link that packets arrived on
Bonding driver changes the skb->dev to the bonding-master before
passing the packet to stack for further processing. This, however
does not make sense for the link-local packets and it loses "the
link info" once its skb->dev is changed to bonding-master. This
patch changes this behavior for link-local packets by not changing
the skb->dev to the bonding-master and maintaining it as it is,
i.e. the link on which the packet arrived.
Signed-off-by: Chonggang Li <chonggangli@google.com> Signed-off-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Sun, 16 Apr 2017 16:48:24 +0000 (09:48 -0700)]
net: rtnetlink: plumb extended ack to doit function
Add netlink_ext_ack arg to rtnl_doit_func. Pass extack arg to nlmsg_parse
for doit functions that call it directly.
This is the first step to using extended error reporting in rtnetlink.
>From here individual subsystems can be updated to set netlink_ext_ack as
needed.
Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Lebrun [Sun, 16 Apr 2017 10:27:14 +0000 (12:27 +0200)]
ipv6: sr: fix BUG due to headroom too small after SRH push
When a locally generated packet receives an SRH with two or more segments,
the remaining headroom is too small to push an ethernet header. This patch
ensures that the headroom is large enough after SRH push.
Fixes: 19d5a26f5ef8de5dcb78799feaf404d717b1aac3 ("ipv6: sr: expand skb head only if necessary") Signed-off-by: David Lebrun <david.lebrun@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
Ilan Tayari [Sun, 16 Apr 2017 08:00:07 +0000 (11:00 +0300)]
gso: Validate assumption of frag_list segementation
Commit 07b26c9454a2 ("gso: Support partial splitting at the frag_list
pointer") assumes that all SKBs in a frag_list (except maybe the last
one) contain the same amount of GSO payload.
This assumption is not always correct, resulting in the following
warning message in the log:
skb_segment: too many frags
For example, mlx5 driver in Striding RQ mode creates some RX SKBs with
one frag, and some with 2 frags.
After GRO, the frag_list SKBs end up having different amounts of payload.
If this frag_list SKB is then forwarded, the aforementioned assumption
is violated.
Validate the assumption, and fall back to software GSO if it not true.
Fixes: 07b26c9454a2 ("gso: Support partial splitting at the frag_list pointer") Signed-off-by: Ilan Tayari <ilant@mellanox.com> Signed-off-by: Ilya Lesokhin <ilyal@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Xin Long [Sat, 15 Apr 2017 13:56:57 +0000 (21:56 +0800)]
sctp: get list_of_streams of strreset outreq earlier
Now when processing strreset out responses, it gets outreq->list_of_streams
only when result is performed. But if result is not performed, str_p will
be NULL. It will cause panic in sctp_ulpevent_make_stream_reset_event if
nums is not 0.
This patch is to fix it by getting outreq->list_of_streams earlier, and
also to improve some codes for the strreset inreq process.
Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Add uid and cookie bpf helper to cg_skb_func_proto
BPF helper functions get_socket_cookie and get_socket_uid can be
used for network traffic classifications, among others. Expose
them also to programs of type BPF_PROG_TYPE_CGROUP_SKB. As of
commit 8f917bba0042 ("bpf: pass sk to helper functions") the
required skb->sk function is available at both cgroup bpf ingress
and egress hooks. With these two new helper, cg_skb_func_proto is
effectively the same as sk_filter_func_proto.
Change since V1:
Instead of add the helper to cg_skb_func_proto, redirect the
cg_skb_func_proto to sk_filter_func_proto since all helper function
in sk_filter_func_proto are applicable to cg_skb_func_proto now.
Signed-off-by: Chenbo Feng <fengc@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
The statistics functionis called with RTNL held during probe
but with RCU held during access from /proc and elsewhere.
This is safe so update the lockdep annotation.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Dan Carpenter [Fri, 14 Apr 2017 19:10:41 +0000 (22:10 +0300)]
net: phy: test the right variable in phy_write_mmd()
This is a copy and paste buglet. We meant to test for ->write_mmd but
we test for ->read_mmd.
Fixes: 1ee6b9bc6206 ("net: phy: make phy_(read|write)_mmd() generic MMD accessors") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Here's the main batch of Bluetooth & 802.15.4 patches for the 4.12
kernel.
- Many fixes to 6LoWPAN, in particular for BLE
- New CA8210 IEEE 802.15.4 device driver (accounting for most of the
lines of code added in this pull request)
- Added Nokia Bluetooth (UART) HCI driver
- Some serdev & TTY changes that are dependencies for the Nokia
driver (with acks from relevant maintainers and an agreement that
these come through the bluetooth tree)
- Support for new Intel Bluetooth device
- Various other minor cleanups/fixes here and there
Please let me know if there are any issues pulling. Thanks.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 14 Apr 2017 17:30:30 +0000 (10:30 -0700)]
bpf: lru: Add map-in-map LRU example
This patch adds a map-in-map LRU example.
If we know only a subset of cores will use the
LRU, we can allocate a common LRU list per targeting core
and store it into an array-of-hashs.
It allows using the common LRU map with map-update performance
comparable to the BPF_F_NO_COMMON_LRU map but without wasting memory
on the unused cores that we know they will never access the LRU map.
Notes that the max_entries for the map-in-map LRU test is 1260000 which
is the max_entries for each inner LRU map. 8 processes have been
started, so 8 * 1260000 = 10080000 (~10M) which is close to what is
used in the BPF_F_NO_COMMON_LRU test.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 14 Apr 2017 17:30:29 +0000 (10:30 -0700)]
bpf: lru: Lower the PERCPU_NR_SCANS from 16 to 4
After doing map_perf_test with a much bigger
BPF_F_NO_COMMON_LRU map, the perf report shows a
lot of time spent in rotating the inactive list (i.e.
__bpf_lru_list_rotate_inactive):
> map_perf_test 32 8 10000 1000000 | awk '{sum += $3}END{print sum}' 19644783 (19M/s)
> map_perf_test 32 8 1000000010000000 | awk '{sum += $3}END{print sum}' 6283930 (6.28M/s)
By inactive, it usually means the element is not in cache. Hence,
there is a need to tune the PERCPU_NR_SCANS value.
This patch finds a better number of elements to
scan during each list rotation. The PERCPU_NR_SCANS (which
is defined the same as PERCPU_FREE_TARGET) decreases
from 16 elements to 4 elements. This change only
affects the BPF_F_NO_COMMON_LRU map.
The test_lru_dist does not show meaningful difference
between 16 and 4. Our production L4 load balancer which uses
the LRU map for conntrack-ing also shows little change in cache
hit rate. Since both benchmark and production data show no
cache-hit difference, PERCPU_NR_SCANS is lowered from 16 to 4.
We can consider making it configurable if we find a usecase
later that shows another value works better and/or use
a different rotation strategy.
After this change:
> map_perf_test 32 8 1000000010000000 | awk '{sum += $3}END{print sum}' 9240324 (9.2M/s)
i.e. 6.28M/s -> 9.2M/s
The test_lru_dist has not shown meaningful difference:
> test_lru_dist zipf.100k.a1_01.out 4000 1:
nr_misses: 31575 (Before) vs 31566 (After)
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 14 Apr 2017 17:30:28 +0000 (10:30 -0700)]
bpf: Allow bpf sample programs (*_user.c) to change bpf_map_def
The current bpf_map_def is statically defined during compile
time. This patch allows the *_user.c program to change it during
runtime. It is done by adding load_bpf_file_fixup_map() which
takes a callback. The callback will be called before creating
each map so that it has a chance to modify the bpf_map_def.
The current usecase is to change max_entries in map_perf_test.
It is interesting to test with a much bigger map size in
some cases (e.g. the following patch on bpf_lru_map.c).
However, it is hard to find one size to fit all testing
environment. Hence, it is handy to take the max_entries
as a cmdline arg and then configure the bpf_map_def during
runtime.
This patch adds two cmdline args. One is to configure
the map's max_entries. Another is to configure the max_cnt
which controls how many times a syscall is called.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 14 Apr 2017 17:30:27 +0000 (10:30 -0700)]
bpf: lru: Refactor LRU map tests in map_perf_test
One more LRU test will be added later in this patch series.
In this patch, we first move all existing LRU map tests into
a single syscall (connect) first so that the future new
LRU test can be added without hunting another syscall.
One of the map name is also changed from percpu_lru_hash_map
to nocommon_lru_hash_map to avoid the confusion with percpu_hash_map.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 14 Apr 2017 17:30:26 +0000 (10:30 -0700)]
bpf: lru: Cleanup test_lru_map.c
This patch does the following cleanup on test_lru_map.c
1) Fix indentation (Replace spaces by tabs)
2) Remove redundant BPF_F_NO_COMMON_LRU test
3) Simplify some comments
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Martin KaFai Lau [Fri, 14 Apr 2017 17:30:25 +0000 (10:30 -0700)]
bpf: lru: Add test_lru_sanity6 for BPF_F_NO_COMMON_LRU
test_lru_sanity3 is not applicable to BPF_F_NO_COMMON_LRU.
It just happens to work when PERCPU_FREE_TARGET == 16.
This patch:
1) Disable test_lru_sanity3 for BPF_F_NO_COMMON_LRU
2) Add test_lru_sanity6 to test list rotation for
the BPF_F_NO_COMMON_LRU map.
Signed-off-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>
net: mvneta: fix failed to suspend if WOL is enabled
Recently, suspend/resume and WOL support are added into mvneta driver.
If we enable WOL, then we get some error as below on Marvell BG4CT
platforms during suspend:
Recently we added support for SW fdbs to take over HW ones, but that
results in changing a user-visible fdb flag thus we need to send a
notification, also it's consistent with how HW takes over SW entries.
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
to find an index in the settings table. phy_find_setting() starts at
index 0, and scans upwards looking for an exact speed and duplex match.
When it doesn't find it, it returns MAX_NUM_SETTINGS - 1, which is
10baseT-Half duplex.
phy_find_valid() then scans from the point (and effectively only checks
one entry) before bailing out, returning MAX_NUM_SETTINGS - 1.
phy_sanitize_settings() then sets ->speed to SPEED_10 and ->duplex to
DUPLEX_HALF whether or not 10baseT-Half is supported or not. This goes
against all the comments against these functions, and 10baseT-Half may
not even be supported by the hardware.
Rework these functions, introducing a new method of scanning the table.
There are two modes of lookup that phylib wants: exact, and inexact.
- in exact mode, we return either an exact match or failure
- in inexact mode, we return an exact match if it exists, a match at
the highest speed that is not greater than the requested speed
(ignoring duplex), or failing that, the lowest supported speed, or
failure.
The biggest difference is that we always check whether the entry is
supported before further consideration, so all unsupported entries are
not considered as candidates.
This results in arguably saner behaviour, better matches the comments,
and is probably what users would expect.
This becomes important as ethernet speeds increase, PHYs exist which do
not support the 10Mbit speeds, and half-duplex is likely to become
obsolete - it's already not even an option on 10Gbit and faster links.
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
Since 3.12 it has been possible to configure the default queuing
discipline via sysctl. This patch adds ability to configure the
default queue discipline in kernel configuration. This is useful for
environments where configuring the value from userspace is difficult
to manage.
The default is still the same as before (pfifo_fast) and it is
possible to change after kernel init with sysctl. This is similar
to how TCP congestion control works.
Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds support for aRFS for TCP and UDP
protocols with IPv4/IPv6.
Signed-off-by: Manish Chopra <manish.chopra@cavium.com> Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds necessary APIs to interface with
qede aRFS support in successive patch.
It also reserves separate PTT entry for aRFS,
[as being in fastpath flow] for hardware access instead of
trying to acquire it at run time from the ptt pool.
Signed-off-by: Manish Chopra <manish.chopra@cavium.com> Signed-off-by: Yuval Mintz <yuval.mintz@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
smsc95xx: Add comments to the registers definition
This chip is used by a lot of embedded devices and also by the Raspberry
Pi 1, 2 & 3 which were created to promote the study of computer
sciences. Students wanting to learn kernel / network device driver
programming through those devices can only rely on the Linux kernel
driver source to make their own.
This commit adds a lot of comments to the registers definition to expand
the register names.
Cc: Steve Glendinning <steve.glendinning@shawell.net> Cc: Microchip Linux Driver Support <UNGLinuxDriver@microchip.com> CC: David Miller <davem@davemloft.net> Signed-off-by: Martin Wetterwald <martin@wetterwald.eu> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Acked-by: Steve Glendinning <steve.glendinning@shawell.net> Acked-by: Woojung Huh <Woojung.Huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
R. Parameswaran [Thu, 13 Apr 2017 01:31:04 +0000 (18:31 -0700)]
l2tp: device MTU setup, tunnel socket needs a lock
The MTU overhead calculation in L2TP device set-up
merged via commit b784e7ebfce8cfb16c6f95e14e8532d0768ab7ff
needs to be adjusted to lock the tunnel socket while
referencing the sub-data structures to derive the
socket's IP overhead.
Reported-by: Guillaume Nault <g.nault@alphalink.fr> Tested-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: R. Parameswaran <rparames@brocade.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Wed, 12 Apr 2017 18:49:04 +0000 (11:49 -0700)]
net: ipv6: send unsolicited NA on admin up
ndisc_notify is the ipv6 equivalent to arp_notify. When arp_notify is
set to 1, gratuitous arp requests are sent when the device is brought up.
The same is expected when ndisc_notify is set to 1 (per ndisc_notify in
Documentation/networking/ip-sysctl.txt). The NA is not sent on NETDEV_UP
event; add it.
Fixes: 5cb04436eef6 ("ipv6: add knob to send unsolicited ND on link-layer address change") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Mon, 17 Apr 2017 15:08:33 +0000 (11:08 -0400)]
Merge branch 'mlx5-RDMA-netdevice'
Saeed Mahameed says:
====================
Mellanox, mlx5 RDMA net device support
This series provides the lower level mlx5 support of RDMA netdevice
creation API [1] suggested and introduced by Intel's HFI OPA VNIC
netdevice driver [2], to enable IPoIB mlx5 RDMA netdevice creation.
mlx5 IPoIB RDMA netdev will serve as an acceleration netdevice for the current
IPoIB ULP generic netdevice, providing:
- mlx5 RSS support.
- mlx5 HW RX,TX offloads (checksum, TSO, LRO, etc ..).
- Full mlx5 HW features transparent to the ULP itself.
The idea here is to reuse and benefit from the already implemented mlx5e netdevice
management and channels API for both etherent and RDMA netdevices, since both IPoIB
and Ethernet netdevices share same common mlx5 HW resources (with some small
exceptions) and share most of the control/data path logic, it is more natural to
have them share the same code.
The differences between IPoIB and Ethernet netdevices can be summarized to:
Steering:
In mlx5, IPoIB traffic is sent and received from an underlay special QP, and in Ethernet
the traffic is handled by vports and vport steering is managed by e-switch or FW.
For IPoIB traffic to get steered correctly the only thing we need to do is to create RSS
HW contexts for RX and TX HW contexts for TX (similar to mlx5e) with the underlay QP attached to
them (underlay QP will be 0 in case of Ethernet).
RX,TX:
Since IPoIB traffic is different, slightly modified RX and TX handlers are required,
still we do some code reuse in data path via common helper functions.
All of the other generic netdevice and mlx5 aspects will be shared between mlx5 Ethernet
and IPoIB netdevices, e.g.
- Channels creation and handling (RQs,SQs,CQs, NAPI, interrupt moderation, etc..)
- Offloads, checksum, GRO, LRO, TSO, and more.
- netdevice logic and non Ethernet specific ndos (open/close, etc..)
In order to achieve what we want:
In patchet 1 to 3, Erez added the supported for underlay QP in mlx5_ifc and refactored
the mlx5 steering code to accept the underlay QP as a parameter for creating steering
objects and enabled flow steering for IB link.
Then we are going to use the mlx5e netdevice profile, which is already used to separate between
NIC and VF representors netdevices, to create new type of IPoIB netdevice profile.
For that, one small refactoring is required to make mlx5e netdevice profile management
more genetic and agnostic to link type which is done in patch #4.
In patch #5, we introduce ipoib.c to host all of mlx5 IPoIB (mlx5i) specific logic and a
skeleton for the IPoIB mlx5 netdevice profile, and we will start filling it in next patches,
using mlx5e already existing APIs.
Patch #6 and #7, Implement init/cleanup RX mlx5i netdev profile handlers to create mlx5 RSS
resources, same as mlx5e but without vlan and L2 steering tables.
Patch #8, Implement init/cleanup TX mlx5i netdev profile handlers, to create TX resources
same as mlx5e but with one TC (tc = 0) support.
Patch #9, Implement mlx5i open/close ndos, where we reuese the mlx5e channels API, to start/stop TX/RX channels.
Patch #10, Create the underlay QP and attach it to mlx5i RSS and TX HW contexts.
Patch #11 and #12, Break down the mlx5e xmit flow into smaller helper function and implement the
mlx5i IPoIB xmit routine.
Patch #13 and #14, Have an RX handler per netdevice profile. We already do this before this series
in a non clean way to separate between NIC netdev and VF representor RX handlers, in patch 13 we make
the RX handler generic and bound to a profile and in patch 14 we implement the IPoIB RX handlers.
Patch #15, Small cleanup to avoid e-switch with IPoIB netdev.
In order to enable mlx5 IPoIB, a merge between the IPoIB RDMA netdev offolad support [3]
- which was alread submitted to the rdma mailing list - and this series is required
plus an extra small patch [4] which will connect between both sides and actually enables the offload.
Once both patch-sets are merged into linux we will have to submit the extra small patch [4], to enable
the feature.
Add check for bit IB_QP_CREATE_NETIF_QP while creating QP.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5e: E-switch vport manager is valid for ethernet only
Currently the driver support only ethernet eswitch, and we want to
protect downstream IPoIB netdev from trying to access it in IB link.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In order to have different RX handler per profile, fix and refactor the
current code to take the rx handler directly from the netdevice profile
rather than computing it on runtime as it was done with the switchdev
mode representor rx handler.
This will also remove the current wrong assumption in mlx5e_alloc_rq
code that mlx5e_priv->ppriv is of the type vport_rep.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Implement mlx5e's IPoIB SKB transmit using the helper functions provided
by mlx5e ethernet tx flow, the only difference in the code between
mlx5e_xmit and mlx5i_xmit is that IPoIB has some extra fields to fill
(UD datagram segment) in the TX descriptor (WQE) and it doesn't need to
have any vlan handling.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Break current mlx5e xmit flow into smaller blocks (helper functions)
in order to reuse them for IPoIB SKB transmission.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Create IPoIB underlay QP needed by the IPoIB netdevice profile for RSS
and TX HW context to perform on IPoIB traffic.
Reset the underlay QP on dev_uninit ndo to stop IPoIB traffic going
through this QP when the ULP IPoIB decides to cleanup.
Implement attach/detach mcast RDMA netdev callbacks for later RDMA
netdev use.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Implement open/close of IPoIB netdevice ndos using mlx5e's
channels API to manage data path resources (RQs/SQs/CQs).
Set IPoIB netdev address on dev_init ndo.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Modify mlx5e tis creation function to accept underlay qp number, which
will be needed by IPoIB.
Implement mlx5i (IPoIB) tx init/cleanup netdevice profile flows to
create one TIS with the IPoIB underlay qp, for IPoIB TX SQs.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Like the mlx5e ethernet mode, on IPoIB mode we need to create RX steering
tables, but IPoIB do not require MAC and VLAN steering tables so the
only tables we create in here are:
1. TTC Table (Traffic Type Classifier table for RSS steering)
2. ARFS Table (for accelerated RFS support)
Creation of those tables is identical to mlx5e ethernet mode, hence the
use of mlx5e_create_ttc_table and mlx5e_arfs_create_tables.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Implement IPoIB RX RSS (RQTs and TIRs) HW objects creation,
All we do here is simply reuse the mlx5e implementation to create
direct and indirect (RSS) steering HW objects.
For that we just expose
mlx5e_{create,destroy}_{direct,indirect}_{rqt,tir} functions into en.h
and call them from ipoib.c in init/cleanup_rx IPoIB netdevice profile
callbacks.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Create mlx5e IPoIB netdevice profile skeleton in the new ipoib.c
file with empty implementation.
Downstream patches will provide the full mlx5 rdma netdevice acceleration
support for IPoIB into this new file, by using the mlx5e netdevice
profile and new mlx5_channels APIs and infrastructures.
Same as already done in mlx5e NIC netdevice and switchdev mode VF
representors.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
In preparation for mlx5e RDMA net_device support, here we generalize
mlx5e_attach/detach in a way that those functions will be agnostic
to link type. For that we move ethernet specific NIC net device logic out
of those functions into {nic,rep}_{enable/disable} mlx5e NIC and
representor profiles callbacks.
Also some of the logic was moved only to NIC profile since it is not right
to have this logic for representor net device (e.g. set port MTU).
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Get the relevant capabilities if supports ipoib_enhanced_offloads and
init the flow steering table accordingly.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5: Refactor create flow table method to accept underlay QP
IB flow tables need the underlay qp to perform flow steering.
Here we change the API of the flow tables creation to accept the
underlay QP number as a parameter in order to support IB (IPoIB) flow
steering.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net/mlx5: Add IPoIB enhanced offloads bits to mlx5_ifc
New capability bit: ipoib_enhanced_offloads, indicates new ability for UD
QP to do RSS and enhanced IPoIB offloads and acceleration.
Add underlay_qpn to the TIS and flow_table objects In order to support
SET_ROOT command, to connect between IPoIB QPs and flow steering tables.
Signed-off-by: Erez Shitrit <erezsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
hv_netvsc: Exclude non-TCP port numbers from vRSS hashing
Azure hosts are not supporting non-TCP port numbers in vRSS hashing for
now. For example, UDP packet loss rate will be high if port numbers are
also included in vRSS hash.
So, we created this patch to use only IP numbers for hashing in non-TCP
traffic.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
hv_netvsc: Fix the queue index computation in forwarding case
If the outgoing skb has a RX queue mapping available, we use the queue
number directly, other than put it through Send Indirection Table.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>