Haiyang Zhang [Thu, 26 Mar 2015 16:03:37 +0000 (09:03 -0700)]
hv_netvsc: Implement batching in send buffer
With this patch, we can send out multiple RNDIS data packets in one send buffer
slot and one VMBus message. It reduces the overhead associated with VMBus messages.
Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com> Reviewed-by: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for your net-next tree.
Basically, nf_tables updates to add the set extension infrastructure and finish
the transaction for sets from Patrick McHardy. More specifically, they are:
1) Move netns to basechain and use recently added possible_net_t, from
Patrick McHardy.
2) Use LOGLEVEL_<FOO> from nf_log infrastructure, from Joe Perches.
3) Restore nf_log_trace that was accidentally removed during conflict
resolution.
4) nft_queue does not depend on NETFILTER_XTABLES, starting from here
all patches from Patrick McHardy.
5) Use raw_smp_processor_id() in nft_meta.
Then, several patches to prepare ground for the new set extension
infrastructure:
6) Pass object length to the hash callback in rhashtable as needed by
the new set extension infrastructure.
7) Cleanup patch to restore struct nft_hash as wrapper for struct
rhashtable
8) Another small source code readability cleanup for nft_hash.
9) Convert nft_hash to rhashtable callbacks.
And finally...
10) Add the new set extension infrastructure.
11) Convert the nft_hash and nft_rbtree sets to use it.
12) Batch set element release to avoid several RCU grace period in a row
and add new function nft_set_elem_destroy() to consolidate set element
release.
13) Return the set extension data area from nft_lookup.
14) Refactor existing transaction code to add some helper functions
and document it.
15) Complete the set transaction support, using similar approach to what we
already use, to activate/deactivate elements in an atomic fashion.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 29 Mar 2015 19:40:37 +0000 (12:40 -0700)]
Merge branch 'tipc-next'
Ying Xue says:
====================
tipc: fix two corner issues
The patch set aims at resolving the following two critical issues:
Patch #1: Resolve a deadlock which happens while all links are reset
Patch #2: Correct a mistake usage of RCU lock which is used to protect
node list
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Thu, 26 Mar 2015 10:10:24 +0000 (18:10 +0800)]
tipc: involve reference counter for node structure
TIPC node hash node table is protected with rcu lock on read side.
tipc_node_find() is used to look for a node object with node address
through iterating the hash node table. As the entire process of what
tipc_node_find() traverses the table is guarded with rcu read lock,
it's safe for us. However, when callers use the node object returned
by tipc_node_find(), there is no rcu read lock applied. Therefore,
this is absolutely unsafe for callers of tipc_node_find().
Now we introduce a reference counter for node structure. Before
tipc_node_find() returns node object to its caller, it first increases
the reference counter. Accordingly, after its caller used it up,
it decreases the counter again. This can prevent a node being used by
one thread from being freed by another thread.
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Reviewed-by: Jon Maloy <jon.maloy@ericson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The correct the sequence of grabbing n_ptr->lock and bclink->lock
should be that the former is first held and the latter is then taken,
which exactly happened on CPU1. But especially when the retransmission
of broadcast link is failed, bclink->lock is first held in
tipc_bclink_rcv(), and n_ptr->lock is taken in link_retransmit_failure()
called by tipc_link_retransmit() subsequently, which is demonstrated on
CPU0. As a result, deadlock occurs.
If the order of holding the two locks happening on CPU0 is reversed, the
deadlock risk will be relieved. Therefore, the node lock taken in
link_retransmit_failure() originally is moved to tipc_bclink_rcv()
so that it's obtained before bclink lock. But the precondition of
the adjustment of node lock is that responding to bclink reset event
must be moved from tipc_bclink_unlock() to tipc_node_unlock().
Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 29 Mar 2015 19:34:05 +0000 (12:34 -0700)]
Merge branch 'be2net-next'
Sathya Perla says:
====================
be2net: patch set
Hi David, this patch set includes 2 feature additions to the be2net driver:
Patch 1 sets up cpu affinity hints for be2net irqs using the
cpumask_set_cpu_local_first() API that first picks the near numa cores
and when they are exhausted, selects the far numa cores.
Patch 2 setups up xps queue mapping for be2net's TXQs to avoid,
by default, TX lock contention.
Patch 3 just bumps up the driver version.
Pls consider applying this patch set to the net-next queue. Thanks!
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Sathya Perla [Thu, 26 Mar 2015 07:05:09 +0000 (03:05 -0400)]
be2net: setup xps queue mapping
This patch sets up xps queue mapping on load, so that TX traffic is
steered to the queue whose irqs are being processed by the current cpu.
This helps in avoiding TX lock contention.
Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch provides hints to irqbalance to map be2net IRQs to
specific CPU cores. cpumask_set_cpu_local_first() is used, which first
maps IRQs to near NUMA cores; when those cores are exhausted, IRQs are
mapped to far NUMA cores.
Signed-off-by: Padmanabh Ratnakar <padmanabh.ratnakar@emulex.com> Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Wed, 25 Mar 2015 22:08:47 +0000 (15:08 -0700)]
tcp: tcp_syn_flood_action() can be static
After commit b717cd6f223e ("tcp: add tcp_conn_request"),
tcp_syn_flood_action() is no longer used from IPv6.
We can make it static, by moving it above tcp_conn_request()
Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Mar 2015 21:26:21 +0000 (14:26 -0700)]
Merge branch 'bcmgenet-next'
Petri Gynther says:
====================
net: bcmgenet: multiple Rx queues support
Final patch set to add support for multiple Rx queues:
1. remove priv->int0_mask and priv->int1_mask
2. modify Tx ring int_enable and int_disable vectors
3. simplify bcmgenet_init_dma()
4. tweak init_umac()
5. rework Tx NAPI code
6. rework Rx NAPI code
7. add support for multiple Rx queues
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther [Wed, 25 Mar 2015 19:35:16 +0000 (12:35 -0700)]
net: bcmgenet: add support for multiple Rx queues
Add support for multiple Rx queues:
1. Add NAPI context per Rx queue
2. Modify Rx interrupt and Rx NAPI code to handle multiple Rx queues
Signed-off-by: Petri Gynther <pgynther@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther [Wed, 25 Mar 2015 19:35:15 +0000 (12:35 -0700)]
net: bcmgenet: rework Rx NAPI code
Introduce new bcmgenet functions to handle the NAPI calls to:
netif_napi_add()
napi_enable()
napi_disable()
netif_napi_del()
Signed-off-by: Petri Gynther <pgynther@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther [Wed, 25 Mar 2015 19:35:14 +0000 (12:35 -0700)]
net: bcmgenet: rework Tx NAPI code
Introduce new bcmgenet functions to handle the NAPI calls to:
netif_napi_add()
napi_enable()
napi_disable()
netif_napi_del()
Signed-off-by: Petri Gynther <pgynther@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther [Wed, 25 Mar 2015 19:35:12 +0000 (12:35 -0700)]
net: bcmgenet: tweak init_umac()
Use more meaningful variable names int0_enable and int1_enable when
enabling bcmgenet interrupts.
For Rx default queue interrupts, use:
UMAC_IRQ_RXDMA_BDONE | UMAC_IRQ_RXDMA_PDONE
For Tx default queue interrupts, use:
UMAC_IRQ_TXDMA_BDONE | UMAC_IRQ_TXDMA_PDONE
Signed-off-by: Petri Gynther <pgynther@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther [Wed, 25 Mar 2015 19:35:11 +0000 (12:35 -0700)]
net: bcmgenet: simplify bcmgenet_init_dma()
Do the two kcalloc() calls first, before proceeding into Rx/Tx DMA init.
Makes the error case handling much simpler.
Signed-off-by: Petri Gynther <pgynther@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Jaedon Shin <jaedon.shin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther [Wed, 25 Mar 2015 19:35:10 +0000 (12:35 -0700)]
net: bcmgenet: modify Tx ring int_enable and int_disable vectors
Remove unnecessary function parameter priv. Use ring->priv instead.
Signed-off-by: Petri Gynther <pgynther@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Petri Gynther [Wed, 25 Mar 2015 19:35:09 +0000 (12:35 -0700)]
net: bcmgenet: remove priv->int0_mask and priv->int1_mask
Remove unused priv->int0_mask and priv->int1_mask.
Signed-off-by: Petri Gynther <pgynther@google.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Fri, 27 Mar 2015 21:18:52 +0000 (14:18 -0700)]
Merge branch 'xgene-next'
Iyappan Subramanian says:
====================
drivers: net: xgene: Add separate tx completion ring
SGMII based 1GbE and 10GbE interfaces support multiple interrupts.
Adding separate tx completion descriptor ring and associating a dedicated irq for the TX completion.
====================
drivers: net: xgene: Add separate tx completion ring
- Added wrapper functions around napi_add, napi_del, napi_enable and napi_disable
- Moved platform_get_irq function call after reading phy_mode
- Associating the new irq to tx completion for the supported ethernet interfaces
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: Keyur Chudgar <kchudgar@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: Keyur Chudgar <kchudgar@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Documentation: dts: xgene: Update interrupt field description
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com> Signed-off-by: Keyur Chudgar <kchudgar@apm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Wed, 25 Mar 2015 14:08:50 +0000 (14:08 +0000)]
netfilter: nf_tables: implement set transaction support
Set elements are the last object type not supporting transaction support.
Implement similar to the existing rule transactions:
The global transaction counter keeps track of two generations, current
and next. Each element contains a bitmask specifying in which generations
it is inactive.
New elements start out as inactive in the current generation and active
in the next. On commit, the previous next generation becomes the current
generation and the element becomes active. The bitmask is then cleared
to indicate that the element is active in all future generations. If the
transaction is aborted, the element is removed from the set before it
becomes active.
When removing an element, it gets marked as inactive in the next generation.
On commit the next generation becomes active and the therefor the element
inactive. It is then taken out of then set and released. On abort, the
element is marked as active for the next generation again.
Lookups ignore elements not active in the current generation.
The current set types (hash/rbtree) both use a field in the extension area
to store the generation mask. This (currently) does not require any
additional memory since we have some free space in there.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Wed, 25 Mar 2015 14:08:47 +0000 (14:08 +0000)]
netfilter: nf_tables: consolide set element destruction
With the conversion to set extensions, it is now possible to consolidate
the different set element destruction functions.
The set implementations' ->remove() functions are changed to only take
the element out of their internal data structures. Elements will be freed
in a batched fashion after the global transaction's completion RCU grace
period.
This reduces the amount of grace periods required for nft_hash from N
to zero additional ones, additionally this guarantees that the set
elements' extensions of all implementations can be used under RCU
protection.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
ipv6: hash net ptr into fragmentation bucket selection
As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.
Cc: Eric Dumazet <edumazet@google.com> Cc: Flavio Leitner <fbl@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv4: hash net ptr into fragmentation bucket selection
As namespaces are sometimes used with overlapping ip address ranges,
we should also use the namespace as input to the hash to select the ip
fragmentation counter bucket.
Cc: Eric Dumazet <edumazet@google.com> Cc: Flavio Leitner <fbl@redhat.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 25 Mar 2015 18:05:56 +0000 (14:05 -0400)]
Merge branch 'tipc-next'
Jon Maloy says:
====================
tipc: some improvements and fixes
We introduce a better algorithm for selecting when and which
users should be subject to link congestion control, plus clean
up some code for that mechanism.
Commit #3 fixes another rare race condition during packet reception.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 25 Mar 2015 16:07:26 +0000 (12:07 -0400)]
tipc: eliminate race condition at dual link establishment
Despite recent improvements, the establishment of dual parallel
links still has a small glitch where messages can bypass each
other. When the second link in a dual-link configuration is
established, part of the first link's traffic will be steered over
to the new link. Although we do have a mechanism to ensure that
packets sent before and after the establishment of the new link
arrive in sequence to the destination node, this is not enough.
The arriving messages will still be delivered upwards in different
threads, something entailing a risk of message disordering during
the transition phase.
To fix this, we introduce a synchronization mechanism between the
two parallel links, so that traffic arriving on the new link cannot
be added to its input queue until we are guaranteed that all
pre-establishment messages have been delivered on the old, parallel
link.
This problem seems to always have been around, but its occurrence is
so rare that it has not been noticed until recent intensive testing.
Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 25 Mar 2015 16:07:25 +0000 (12:07 -0400)]
tipc: clean up handling of link congestion
After the recent changes in message importance handling it becomes
possible to simplify handling of messages and sockets when we
encounter link congestion.
We merge the function tipc_link_cong() into link_schedule_user(),
and simplify the code of the latter. The code should now be
easier to follow, especially regarding return codes and handling
of the message that caused the situation.
In case the scheduling function is unable to pre-allocate a wakeup
message buffer, it now returns -ENOBUFS, which is a more correct
code than the previously used -EHOSTUNREACH.
Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Jon Paul Maloy [Wed, 25 Mar 2015 16:07:24 +0000 (12:07 -0400)]
tipc: introduce starvation free send algorithm
Currently, we only use a single counter; the length of the backlog
queue, to determine whether a message should be accepted to the queue
or not. Each time a message is being sent, the queue length is compared
to a threshold value for the message's importance priority. If the queue
length is beyond this threshold, the message is rejected. This algorithm
implies a risk of starvation of low importance senders during very high
load, because it may take a long time before the backlog queue has
decreased enough to accept a lower level message.
We now eliminate this risk by introducing a counter for each importance
priority. When a message is sent, we check only the queue level for that
particular message's priority. If that is ok, the message can be added
to the backlog, irrespective of the queue level for other priorities.
This way, each level is guaranteed a certain portion of the total
bandwidth, and any risk of starvation is eliminated.
Reviewed-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Guenter Roeck [Wed, 25 Mar 2015 15:08:37 +0000 (08:08 -0700)]
net: dsa: Handle non-bridge master change
Master change notifications may occur other than when joining or
leaving a bridge, for example when being added to or removed from
a bond or Open vSwitch. In that case, do nothing instead of asking
the switch driver to remove a port from a bridge that it didn't join.
Signed-off-by: Guenter Roeck <linux@roeck-us.net> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Wed, 25 Mar 2015 13:07:50 +0000 (13:07 +0000)]
netfilter: nf_tables: convert hash and rbtree to set extensions
The set implementations' private struct will only contain the elements
needed to maintain the search structure, all other elements are moved
to the set extensions.
Element allocation and initialization is performed centrally by
nf_tables_api instead of by the different set implementations'
->insert() functions. A new "elemsize" member in the set ops specifies
the amount of memory to reserve for internal usage. Destruction
will also be moved out of the set implementations by a following patch.
Except for element allocation, the patch is a simple conversion to
using data from the extension area.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Wed, 25 Mar 2015 13:07:48 +0000 (13:07 +0000)]
netfilter: nft_hash: convert to use rhashtable callbacks
A following patch will convert sets to use so called set extensions,
where the key is not located in a fixed position anymore. This will
require rhashtable hashing and comparison callbacks to be used.
As preparation, convert nft_hash to use these callbacks without any
functional changes.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Patrick McHardy [Wed, 25 Mar 2015 13:07:45 +0000 (13:07 +0000)]
rhashtable: provide len to obj_hashfn
nftables sets will be converted to use so called setextensions, moving
the key to a non-fixed position. To hash it, the obj_hashfn must be used,
however it so far doesn't receive the length parameter.
Pass the key length to obj_hashfn() and convert existing users.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
crypto: algif - fix warn: unsigned 'used' is never less than zero
Change type from unsigned long to int to fix an issue reported by kbuild robot:
crypto/algif_skcipher.c:596 skcipher_recvmsg_async() warn: unsigned 'used' is
never less than zero.
Signed-off-by: Tadeusz Struk <tadeusz.struk@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Wed, 25 Mar 2015 10:09:40 +0000 (18:09 +0800)]
tipc: fix a link reset issue due to retransmission failures
When a node joins a cluster while we are transmitting a fragment
stream over the broadcast link, it's missing the preceding fragments
needed to build a meaningful message. As a result, the node has to
drop it. However, as the fragment message is not acknowledged to
its sender before it's dropped, it accidentally causes link reset
of retransmission failure on the node.
Reported-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Reviewed-by: Erik Hugne <erik.hugne@ericsson.com> Tested-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sebastian Ott [Wed, 25 Mar 2015 15:31:37 +0000 (16:31 +0100)]
s390: fix /proc/interrupts output
The irqclass_sub_desc array and enum interruption_class are out of sync
thus /proc/interrupts is broken. Remove IRQIO_CLW.
Signed-off-by: Sebastian Ott <sebott@linux.vnet.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Patrick McHardy [Wed, 25 Mar 2015 08:09:56 +0000 (08:09 +0000)]
netfilter: nft_meta: use raw_smp_processor_id()
Using smp_processor_id() triggers warnings with PREEMPT_RCU. There is no
point in disabling preemption since we only collect the numeric value,
so use raw_smp_processor_id() instead.
Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
netfilter: nf_tables: restore nf_log_trace() in nf_tables_core.c
As described by 8787a81 ("netfilter: restore rule tracing via
nfnetlink_log"), this accidentally slipped through during conflict
resolution in 0b4fe07.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Joe Perches [Mon, 23 Mar 2015 18:50:10 +0000 (11:50 -0700)]
netfilter: Use LOGLEVEL_<FOO> defines
Use the #defines where appropriate.
Miscellanea:
Add explicit #include <linux/kernel.h> where it was not
previously used so that these #defines are a bit more
explicitly defined instead of indirectly included via:
module.h->moduleparam.h->kernel.h
Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
We should init ireq->ireq_family based on listener sk_family,
not the actual protocol carried by SYN packet.
This means we can set ireq_family in inet_reqsk_alloc()
Fixes: 72f329329ce1 ("inet: introduce ireq_family") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
The original comment was not really informative or funny
as well as sexist. Replace it with a better explanation of
why the driver does stop and what the impacts are.
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Wed, 25 Mar 2015 01:16:30 +0000 (21:16 -0400)]
Merge branch 'listener_refactor_16'
Eric Dumazet says:
====================
tcp: listener refactor part 16
A CONFIG_PROVE_RCU=y build revealed an RCU splat I had to fix.
I added const qualifiers to various md5 methods, as I expect
to call them on behalf of request sock traffic even if
the listener socket is not locked. This seems ok, but adding
const makes the contract clearer. Note a good reduction
of code size thanks to request/establish sockets convergence.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 24 Mar 2015 22:58:52 +0000 (15:58 -0700)]
tcp: md5: fix rcu lockdep splat
While timer handler effectively runs a rcu read locked section,
there is no explicit rcu_read_lock()/rcu_read_unlock() annotations
and lockdep can be confused here :
net/ipv4/tcp_ipv4.c-906- /* caller either holds rcu_read_lock() or socket lock */
net/ipv4/tcp_ipv4.c:907: md5sig = rcu_dereference_check(tp->md5sig_info,
net/ipv4/tcp_ipv4.c-908- sock_owned_by_user(sk) ||
net/ipv4/tcp_ipv4.c-909- lockdep_is_held(&sk->sk_lock.slock));
Let's explicitely acquire rcu_read_lock() in tcp_make_synack()
Before commit 95313fcb2b6 ("inet: get rid of central tcp/dccp listener
timer"), we were holding listener lock so lockdep was happy.
Fixes: 95313fcb2b6 ("inet: get rid of central tcp/dccp listener timer") Signed-off-by: Eric DUmazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 24 Mar 2015 21:53:06 +0000 (17:53 -0400)]
Merge branch 'rhashtable-next'
Thomas Graf says:
====================
rhashtable updates on top of Herbert's work
Patch 1 is a bugfix for an RCU splash I encountered while testing.
Patch 2 & 3 are pure cleanups. Patch 4 disables automatic shrinking
by default as discussed in previous thread. Patch 5 removes some
rhashtable internal knowledge from nft_hash and fixes another RCU
splash.
I've pushed various rhashtable tests (Netlink, nft) together with a
Makefile to a git tree [0] for easier stress testing.
Thomas Graf [Tue, 24 Mar 2015 13:18:20 +0000 (14:18 +0100)]
rhashtable: Add rhashtable_free_and_destroy()
rhashtable_destroy() variant which stops rehashes, iterates over
the table and calls a callback to release resources.
Avoids need for nft_hash to embed rhashtable internals and allows to
get rid of the being_destroyed flag. It also saves a 2nd mutex
lock upon destruction.
Also fixes an RCU lockdep splash on nft set destruction due to
calling rht_for_each_entry_safe() without holding bucket locks.
Open code this loop as we need know that no mutations may occur in
parallel.
Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Thomas Graf [Tue, 24 Mar 2015 13:18:16 +0000 (14:18 +0100)]
rhashtable: Extend RCU read lock into rhashtable_insert_rehash()
rhashtable_insert_rehash() requires RCU locks to be held in order
to access ht->tbl and traverse to the last table.
Fixes: 43719bf4351e ("rhashtable: Add immediate rehash during insertion") Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Sekletar [Tue, 24 Mar 2015 13:48:41 +0000 (14:48 +0100)]
filter: introduce SKF_AD_VLAN_TPID BPF extension
If vlan offloading takes place then vlan header is removed from frame
and its contents, both vlan_tci and vlan_proto, is available to user
space via TPACKET interface. However, only vlan_tci can be used in BPF
filters.
This commit introduces a new BPF extension. It makes possible to load
the value of vlan_proto (vlan TPID) to register A. Support for classic
BPF and eBPF is being added, analogous to skb->protocol.
Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Alexei Starovoitov <ast@plumgrid.com> Cc: Jiri Pirko <jpirko@redhat.com> Signed-off-by: Michal Sekletar <msekleta@redhat.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@plumgrid.com> Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: fix sparse warnings in privacy stable addresses generation
Those warnings reported by sparse endianness check (via kbuild test robot)
are harmless, nevertheless fix them up and make the code a little bit
easier to read.
Reported-by: kbuild test robot <fengguang.wu@intel.com> Fixes: 576441be3c915c0 ("ipv6: generation of stable privacy addresses for link-local and autoconf") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Ying Xue [Tue, 24 Mar 2015 08:59:21 +0000 (16:59 +0800)]
tipc: fix compile error when IPV6=m and TIPC=y
When IPV6=m and TIPC=y, below error will appear during building kernel
image:
net/tipc/udp_media.c:196:
undefined reference to `ip6_dst_lookup'
make: *** [vmlinux] Error 1
As ip6_dst_lookup() is implemented in IPV6 and IPV6 is compiled as
module, ip6_dst_lookup() is not built-in core kernel image. As a
result, compiler cannot find 'ip6_dst_lookup' reference while
compiling TIPC code into core kernel image.
But with the method introduced by commit d22e5150283a ("ipv6: export a
stub for IPv6 symbols used by vxlan"), we can avoid the compile error
through "ipv6_stub" pointer to access ip6_dst_lookup().
Fixes: 38e17eaa07d7 ("tipc: add ip/udp media type") Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
WANG Cong [Tue, 24 Mar 2015 18:53:31 +0000 (11:53 -0700)]
net: allow to delete a whole device group
With dev group, we can change a batch of net devices,
so we should allow to delete them together too.
Group 0 is not allowed to be deleted since it is
the default group.
Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Sasha Levin [Mon, 23 Mar 2015 19:30:00 +0000 (15:30 -0400)]
tipc: validate length of sockaddr in connect() for dgram/rdm
Commit 5f94e5c ("tipc: add support for connect() on dgram/rdm sockets")
hasn't validated user input length for the sockaddr structure which allows
a user to overwrite kernel memory with arbitrary input.
Fixes: 5f94e5c ("tipc: add support for connect() on dgram/rdm sockets") Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: remove never used forwarding_accel_ops pointer from net_device
Cc: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Mon, 23 Mar 2015 22:53:17 +0000 (09:53 +1100)]
rhashtable: Fix sleeping inside RCU critical section in walk_stop
The commit 0062d19dc7de879c3c1033f6ea970a6f8e8e8a1f ("rhashtable:
Fix use-after-free in rhashtable_walk_stop") fixed a real bug
but created another one because we may end up sleeping inside an
RCU critical section.
This patch fixes it properly by replacing the mutex with a spin
lock that specifically protects the walker lists.
Reported-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: add documentation for stable_secret, idgen_delay and idgen_retries knobs
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
ipv6: introduce idgen_delay and idgen_retries knobs
This is specified by RFC 7217.
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
If a DAD conflict is detected, we want to retry privacy stable address
generation up to idgen_retries (= 3) times with a delay of idgen_delay
(= 1 second). Add the logic to addrconf_dad_failure.
By design, we don't clean up dad failed permanent addresses.
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
We need to mark appropriate addresses so we can do retries in case their
DAD failed.
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
is the RID (random identifier). As the hash function F we chose one
round of sha1. Prefix will be either the link-local prefix or the
router advertised one. As Net_Iface we use the MAC address of the
device. DAD_Counter and secret_key are implemented as specified.
We don't use Network_ID, as it couples the code too closely to other
subsystems. It is specified as optional in the RFC.
As Net_Iface we only use the MAC address: we simply have no stable
identifier in the kernel we could possibly use: because this code might
run very early, we cannot depend on names, as they might be changed by
user space early on during the boot process.
A new address generation mode is introduced,
IN6_ADDR_GEN_MODE_STABLE_PRIVACY. With iproute2 one can switch back to
none or eui64 address configuration mode although the stable_secret is
already set.
We refuse writes to ipv6/conf/all/stable_secret but only allow
ipv6/conf/default/stable_secret and the interface specific file to be
written to. The default stable_secret is used as the parameter for the
namespace, the interface specific can overwrite the secret, e.g. when
switching a network configuration from one system to another while
inheriting the secret.
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch implements the procfs logic for the stable_address knob:
The secret is formatted as an ipv6 address and will be stored per
interface and per namespace. We track initialized flag and return EIO
errors until the secret is set.
We don't inherit the secret to newly created namespaces.
Cc: Erik Kline <ek@google.com> Cc: Fernando Gont <fgont@si6networks.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: YOSHIFUJI Hideaki/吉藤英明 <hideaki.yoshifuji@miraclelinux.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Tue, 24 Mar 2015 02:10:50 +0000 (22:10 -0400)]
Merge branch 'bcmgenet-next'
Florian Fainelli says:
====================
net: bcmgenet: integrated GPHY power up/down
This patch series implements integrated Gigabit PHY power up/down, which allows
us to save close to 300mW on some designs when the Gigabit PHY is known to be
unused (e.g: during bcmgenet_close or bcmgenet_suspend not doing Wake-on-LAN).
Changes in v2:
- drop an extra bcmgenet_ext_readl in bcmgenet_phy_power_set
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 23 Mar 2015 22:09:57 +0000 (15:09 -0700)]
net: bcmgenet: power down and up GPHY during suspend/resume
In case the interface is not used, power down the integrated GPHY during
suspend. Similarly to bcmgenet_open(), bcmgenet_resume() powers on the GPHY
prior to any UniMAC activity.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 23 Mar 2015 22:09:56 +0000 (15:09 -0700)]
net: bcmgenet: power up and down integrated GPHY when unused
Power up the GPHY while we are bringing-up the network interface, and
conversely, upon bring down, power the GPHY down. In order to avoid
creating hardware hazards, make sure that the GPHY gets powered on
during bcmgenet_open() prior to the UniMAC being reset as the UniMAC may
start creating activity towards the GPHY if we reverse the steps.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 23 Mar 2015 22:09:53 +0000 (15:09 -0700)]
net: bcmgenet: rename bcmgenet_ephy_power_up
In preparation for implementing the power down GPHY sequence, rename
bcmgenet_ephy_power_up to illustrate that it is not EPHY specific but
PHY agnostic, and add an "enable" argument.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 23 Mar 2015 22:09:52 +0000 (15:09 -0700)]
net: bcmgenet: update bcmgenet_ephy_power_up to clear CK25_DIS bit
The CK25_DIS bit controls whether a 25Mhz clock is fed to the GPHY or
not, in preparation for powering down the integrated GPHY when relevant,
make sure we clear that bit.
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Recall that the original implementation in br_multicast used
two list pointers per hash node and therefore is limited to at
most one rehash at a time since you need one list pointer for
the old table and one for the new table.
Thanks to Josh Triplett's suggestion of using a single list pointer
we're no longer limited by that. So it is perfectly OK to have
an arbitrary number of tables in existence at any one time.
The reader and removal simply has to walk from the oldest table
to the newest table in order not to miss anything. Insertion
without lookup are just as easy as we simply go to the last table
that we can find and add the entry there.
However, insertion with uniqueness lookup is more complicated
because we need to ensure that two simultaneous insertions of the
same key do not both succeed. To achieve this, all insertions
including those without lookups are required to obtain the bucket
lock from the oldest hash table that is still alive. This is
determined by having the rehasher (there is only one rehashing
thread in the system) keep a pointer of where it is up to. If
a bucket has already been rehashed then it is dead, i.e., there
cannot be any more insertions to it, otherwise it is considered
alive. This guarantees that the same key cannot be inserted
in two different tables in parallel.
Patch 1 is actually a bug fix for the walker.
Patch 2-5 eliminates unnecessary out-of-line copies of jhash.
Patch 6 makes rhashtable_shrink shrink to fit.
Patch 7 introduces multiple rehashing. This means that if we
decide to grow then we will grow regardless of whether the previous
one has finished. However, this is still asynchronous meaning
that if insertions come fast enough we may still end up with a
table that is overutilised.
Patch 8 adds support for GFP_ATOMIC allocations of struct bucket_table.
Finally patch 9 enables immediate rehashing. This is done either
when the table reaches 100% utilisation, or when the chain length
exceeds 16 (the latter can be disabled on request, e.g., for
nft_hash.
With these patches the system should no longer have any trouble
dealing with fast insertions on a small table. In the worst
case you end up with a list of tables that's log N in length
while the rehasher catches up.
v3 restores rhashtable_shrink and fixes a number of bugs in the
multiple rehashing patches (7 and 9).
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Mon, 23 Mar 2015 13:50:28 +0000 (00:50 +1100)]
rhashtable: Add immediate rehash during insertion
This patch reintroduces immediate rehash during insertion. If
we find during insertion that the table is full or the chain
length exceeds a set limit (currently 16 but may be disabled
with insecure_elasticity) then we will force an immediate rehash.
The rehash will contain an expansion if the table utilisation
exceeds 75%.
If this rehash fails then the insertion will fail. Otherwise the
insertion will be reattempted in the new hash table.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
This patch adds the ability to allocate bucket table with GFP_ATOMIC
instead of GFP_KERNEL. This is needed when we perform an immediate
rehash during insertion.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Acked-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>