kernel.git
6 years agonet: sched: add helper function to take reference to Qdisc
Vlad Buslov [Mon, 24 Sep 2018 16:22:52 +0000 (19:22 +0300)]
net: sched: add helper function to take reference to Qdisc

Implement function to take reference to Qdisc that relies on rcu read lock
instead of rtnl mutex. Function only takes reference to Qdisc if reference
counter isn't zero. Intended to be used by unlocked cls API.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: extend Qdisc with rcu
Vlad Buslov [Mon, 24 Sep 2018 16:22:51 +0000 (19:22 +0300)]
net: sched: extend Qdisc with rcu

Currently, Qdisc API functions assume that users have rtnl lock taken. To
implement rtnl unlocked classifiers update interface, Qdisc API must be
extended with functions that do not require rtnl lock.

Extend Qdisc structure with rcu. Implement special version of put function
qdisc_put_unlocked() that is called without rtnl lock taken. This function
only takes rtnl lock if Qdisc reference counter reached zero and is
intended to be used as optimization.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: sched: rename qdisc_destroy() to qdisc_put()
Vlad Buslov [Mon, 24 Sep 2018 16:22:50 +0000 (19:22 +0300)]
net: sched: rename qdisc_destroy() to qdisc_put()

Current implementation of qdisc_destroy() decrements Qdisc reference
counter and only actually destroy Qdisc if reference counter value reached
zero. Rename qdisc_destroy() to qdisc_put() in order for it to better
describe the way in which this function currently implemented and used.

Extract code that deallocates Qdisc into new private qdisc_destroy()
function. It is intended to be shared between regular qdisc_put() and its
unlocked version that is introduced in next patch in this series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: core: netlink: add helper refcount dec and lock function
Vlad Buslov [Mon, 24 Sep 2018 16:22:49 +0000 (19:22 +0300)]
net: core: netlink: add helper refcount dec and lock function

Rtnl lock is encapsulated in netlink and cannot be accessed by other
modules directly. This means that reference counted objects that rely on
rtnl lock cannot use it with refcounter helper function that atomically
releases decrements reference and obtains mutex.

This patch implements simple wrapper function around refcount_dec_and_lock
that obtains rtnl lock if reference counter value reached 0.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotls: Fixed a memory leak during socket close
Vakul Garg [Tue, 25 Sep 2018 14:51:51 +0000 (20:21 +0530)]
tls: Fixed a memory leak during socket close

During socket close, if there is a open record with tx context, it needs
to be be freed apart from freeing up plaintext and encrypted scatter
lists. This patch frees up the open record if present in tx context.

Also tls_free_both_sg() has been renamed to tls_free_open_rec() to
indicate that the free record in tx context is being freed inside the
function.

Fixes: b574142da9d5 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotls: Fix socket mem accounting error under async encryption
Vakul Garg [Tue, 25 Sep 2018 10:56:17 +0000 (16:26 +0530)]
tls: Fix socket mem accounting error under async encryption

Current async encryption implementation sometimes showed up socket
memory accounting error during socket close. This results in kernel
warning calltrace. The root cause of the problem is that socket var
sk_forward_alloc gets corrupted due to access in sk_mem_charge()
and sk_mem_uncharge() being invoked from multiple concurrent contexts
in multicore processor. The apis sk_mem_charge() and sk_mem_uncharge()
are called from functions alloc_plaintext_sg(), free_sg() etc. It is
required that memory accounting apis are called under a socket lock.

The plaintext sg data sent for encryption is freed using free_sg() in
tls_encryption_done(). It is wrong to call free_sg() from this function.
This is because this function may run in irq context. We cannot acquire
socket lock in this function.

We remove calling of function free_sg() for plaintext data from
tls_encryption_done() and defer freeing up of plaintext data to the time
when the record is picked up from tx_list and transmitted/freed. When
tls_tx_records() gets called, socket is already locked and thus there is
no concurrent access problem.

Fixes: b574142da9d5 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
David S. Miller [Tue, 25 Sep 2018 17:35:29 +0000 (10:35 -0700)]
Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net

Version bump conflict in batman-adv, take what's in net-next.

iavf conflict, adjustment of netdev_ops in net-next conflicting
with poll controller method removal in net.

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: macb: Clean 64b dma addresses if they are not detected
Michal Simek [Tue, 25 Sep 2018 06:32:50 +0000 (08:32 +0200)]
net: macb: Clean 64b dma addresses if they are not detected

Clear ADDR64 dma bit in DMACFG register in case that HW_DMA_CAP_64B is
not detected on 64bit system.
The issue was observed when bootloader(u-boot) does not check macb
feature at DCFG6 register (DAW64_OFFSET) and enabling 64bit dma support
by default. Then macb driver is reading DMACFG register back and only
adding 64bit dma configuration but not cleaning it out.

Signed-off-by: Michal Simek <michal.simek@xilinx.com>
Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'r8169-series-with-smaller-improvements'
David S. Miller [Tue, 25 Sep 2018 17:28:03 +0000 (10:28 -0700)]
Merge branch 'r8169-series-with-smaller-improvements'

Heiner Kallweit says:

====================
r8169: series with smaller improvements

This series includes smaller improvements, nothing exciting.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agor8169: improve a check in rtl_init_one
Heiner Kallweit [Tue, 25 Sep 2018 05:59:36 +0000 (07:59 +0200)]
r8169: improve a check in rtl_init_one

The check for pci_is_pcie() is redundant here because all
chip versions >=18 are PCIe only anyway. In addition use
dma_set_mask_and_coherent() instead of separate calls to
pci_set_dma_mask() and pci_set_consistent_dma_mask().

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agor8169: improve rtl8169_irq_mask_and_ack
Heiner Kallweit [Tue, 25 Sep 2018 05:58:00 +0000 (07:58 +0200)]
r8169: improve rtl8169_irq_mask_and_ack

Code can be slightly simplified by acking even events we're not
interested in. In addition add a comment making clear that the
read has no functional purpose and is just a PCI commit.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agor8169: use default watchdog timeout
Heiner Kallweit [Tue, 25 Sep 2018 05:56:53 +0000 (07:56 +0200)]
r8169: use default watchdog timeout

The networking core has a default watchdog timeout of 5s. I see no
need to define an own timeout of 6s which is basically the same.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Greg Kroah-Hartman [Tue, 25 Sep 2018 16:14:14 +0000 (18:14 +0200)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

James writes:
  "SCSI fixes on 20180925

   Nine obvious bug fixes mostly in individual drivers.  The target fix
   is of particular importance because it's CVE related."

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  scsi: sd: don't crash the host on invalid commands
  scsi: ipr: System hung while dlpar adding primary ipr adapter back
  scsi: target: iscsi: Use bin2hex instead of a re-implementation
  scsi: target: iscsi: Use hex2bin instead of a re-implementation
  scsi: lpfc: Synchronize access to remoteport via rport
  scsi: ufs: Disable blk-mq for now
  scsi: sd: Contribute to randomness when running rotational device
  scsi: ibmvscsis: Ensure partition name is properly NUL terminated
  scsi: ibmvscsis: Fix a stringop-overflow warning

6 years agoMerge tag 'usb-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Greg Kroah-Hartman [Tue, 25 Sep 2018 15:54:55 +0000 (17:54 +0200)]
Merge tag 'usb-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb

I wrote:
  "USB fixes for 4.19-rc6

   Here are some small USB core and driver fixes for reported issues for
   4.19-rc6.

   The most visible is the oops fix for when the USB core is built into the
   kernel that is present in 4.18.  Turns out not many people actually do
   that so it went unnoticed for a while.  The rest is some tiny typec,
   musb, and other core fixes.

   All have been in linux-next with no reported issues."

* tag 'usb-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
  usb: typec: mux: Take care of driver module reference counting
  usb: core: safely deal with the dynamic quirk lists
  usb: roles: Take care of driver module reference counting
  USB: handle NULL config in usb_find_alt_setting()
  USB: fix error handling in usb_driver_claim_interface()
  USB: remove LPM management from usb_driver_claim_interface()
  USB: usbdevfs: restore warning for nonsensical flags
  USB: usbdevfs: sanitize flags more
  Revert "usb: cdc-wdm: Fix a sleep-in-atomic-context bug in service_outstanding_interrupt()"
  usb: musb: dsps: do not disable CPPI41 irq in driver teardown

6 years agoMerge tag 'tty-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Greg Kroah-Hartman [Tue, 25 Sep 2018 15:22:50 +0000 (17:22 +0200)]
Merge tag 'tty-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty

I wrote:
  "TTY/Serial driver fixes for 4.19-rc6

   Here are a number of small tty and serial driver fixes for reported
   issues for 4.19-rc6.

   One should hopefully resolve a much-reported issue that syzbot has found
   in the tty layer.  Although there are still more issues there, getting
   this fixed is nice to see finally happen.

   All of these have been in linux-next for a while with no reported
   issues."

* tag 'tty-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
  serial: imx: restore handshaking irq for imx1
  tty: vt_ioctl: fix potential Spectre v1
  tty: Drop tty->count on tty_reopen() failure
  serial: cpm_uart: return immediately from console poll
  tty: serial: lpuart: avoid leaking struct tty_struct
  serial: mvebu-uart: Fix reporting of effective CSIZE to userspace

6 years agoMerge tag 'char-misc-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregk...
Greg Kroah-Hartman [Tue, 25 Sep 2018 15:01:52 +0000 (17:01 +0200)]
Merge tag 'char-misc-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc

Greg (well I), wrote:
  "Char/Misc driver fixes for 4.19-rc6

   Here are some soundwire and intel_th (tracing) driver fixes for some
   reported issues.

   All of these have been in linux-next for a week with no reported issues."

* tag 'char-misc-4.19-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
  intel_th: pci: Add Ice Lake PCH support
  intel_th: Fix resource handling for ACPI glue layer
  intel_th: Fix device removal logic
  soundwire: Fix acquiring bus lock twice during master release
  soundwire: Fix incorrect exit after configuring stream
  soundwire: Fix duplicate stream state assignment

6 years agoRevert "uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name"
Lubomir Rintel [Mon, 24 Sep 2018 12:18:34 +0000 (13:18 +0100)]
Revert "uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name"

This changes UAPI, breaking iwd and libell:

  ell/key.c: In function 'kernel_dh_compute':
  ell/key.c:205:38: error: 'struct keyctl_dh_params' has no member named 'private'; did you mean 'dh_private'?
    struct keyctl_dh_params params = { .private = private,
                                        ^~~~~~~
                                        dh_private

This reverts commit 76542264ef964cf817131bdd616b5d519ec2d0f7.

Fixes: 76542264ef96 ("uapi/linux/keyctl.h: don't use C++ reserved keyword as a struct member name")
Signed-off-by: Lubomir Rintel <lkundrak@v3.sk>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Randy Dunlap <rdunlap@infradead.org>
cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
cc: Stephan Mueller <smueller@chronox.de>
cc: James Morris <jmorris@namei.org>
cc: "Serge E. Hallyn" <serge@hallyn.com>
cc: Mat Martineau <mathew.j.martineau@linux.intel.com>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Linus Torvalds <torvalds@linux-foundation.org>
cc: <stable@vger.kernel.org>
Signed-off-by: James Morris <james.morris@microsoft.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
6 years agoMerge gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net
Greg Kroah-Hartman [Tue, 25 Sep 2018 09:19:28 +0000 (11:19 +0200)]
Merge gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net

Dave writes:
  "Networking fixes:

  1) Fix multiqueue handling of coalesce timer in stmmac, from Jose
     Abreu.

   2) Fix memory corruption in NFC, from Suren Baghdasaryan.

   3) Don't write reserved bits in ravb driver, from Kazuya Mizuguchi.

   4) SMC bug fixes from Karsten Graul, YueHaibing, and Ursula Braun.

   5) Fix TX done race in mvpp2, from Antoine Tenart.

   6) ipv6 metrics leak, from Wei Wang.

   7) Adjust firmware version requirements in mlxsw, from Petr Machata.

   8) Fix autonegotiation on resume in r8169, from Heiner Kallweit.

   9) Fixed missing entries when dumping /proc/net/if_inet6, from Jeff
      Barnhill.

   10) Fix double free in devlink, from Dan Carpenter.

   11) Fix ethtool regression from UFO feature removal, from Maciej
       Żenczykowski.

   12) Fix drivers that have a ndo_poll_controller() that captures the
       cpu entirely on loaded hosts by trying to drain all rx and tx
       queues, from Eric Dumazet.

   13) Fix memory corruption with jumbo frames in aquantia driver, from
       Friedemann Gerold."

* gitolite.kernel.org:/pub/scm/linux/kernel/git/davem/net: (79 commits)
  net: mvneta: fix the remaining Rx descriptor unmapping issues
  ip_tunnel: be careful when accessing the inner header
  mpls: allow routes on ip6gre devices
  net: aquantia: memory corruption on jumbo frames
  tun: remove ndo_poll_controller
  nfp: remove ndo_poll_controller
  bnxt: remove ndo_poll_controller
  bnx2x: remove ndo_poll_controller
  mlx5: remove ndo_poll_controller
  mlx4: remove ndo_poll_controller
  i40evf: remove ndo_poll_controller
  ice: remove ndo_poll_controller
  igb: remove ndo_poll_controller
  ixgb: remove ndo_poll_controller
  fm10k: remove ndo_poll_controller
  ixgbevf: remove ndo_poll_controller
  ixgbe: remove ndo_poll_controller
  bonding: use netpoll_poll_dev() helper
  netpoll: make ndo_poll_controller() optional
  rds: Fix build regression.
  ...

6 years agodpaa2-eth: Make Rx flow hash key configurable
Ioana Ciocoi Radulescu [Mon, 24 Sep 2018 15:36:21 +0000 (15:36 +0000)]
dpaa2-eth: Make Rx flow hash key configurable

Until now, the Rx flow hash key was a 5-tuple (IP src, IP dst,
IP nextproto, L4 src port, L4 dst port) fixed value that we
configured at probe.

Add support for configuring this hash key at runtime.
We support all standard header fields configurable through ethtool,
but cannot differentiate between flow types, so the same hash key
is applied regardless of protocol.

We also don't support the discard option.

Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: mvneta: fix the remaining Rx descriptor unmapping issues
Antoine Tenart [Mon, 24 Sep 2018 14:56:13 +0000 (16:56 +0200)]
net: mvneta: fix the remaining Rx descriptor unmapping issues

With CONFIG_DMA_API_DEBUG enabled we get DMA unmapping warning in
various places of the mvneta driver, for example when putting down an
interface while traffic is passing through.

The issue is when using s/w buffer management, the Rx buffers are mapped
using dma_map_page but unmapped with dma_unmap_single. This patch fixes
this by using the right unmapping function.

Fixes: 0db0df595cbd ("net: mvneta: Improve the buffer allocation method for SWBM")
Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Reviewed-by: Gregory CLEMENT <gregory.clement@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoip_tunnel: be careful when accessing the inner header
Paolo Abeni [Mon, 24 Sep 2018 13:48:19 +0000 (15:48 +0200)]
ip_tunnel: be careful when accessing the inner header

Cong noted that we need the same checks introduced by commit a4c49f991d44
("ip6_tunnel: be careful when accessing the inner header")
even for ipv4 tunnels.

Fixes: e7ccb3806ba4 ("GRE: Refactor GRE tunneling code.")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: qca_spi: Introduce write register verification
Stefan Wahren [Mon, 24 Sep 2018 11:20:10 +0000 (13:20 +0200)]
net: qca_spi: Introduce write register verification

The SPI protocol for the QCA7000 doesn't have any fault detection.
In order to increase the drivers reliability in noisy environments,
we could implement a write verification inspired by the enc28j60.
This should avoid situations were the driver wrongly assumes the
receive interrupt is enabled and miss all incoming packets.

This function is disabled per default and can be controlled via module
parameter wr_verify.

Signed-off-by: Michael Heimpold <michael.heimpold@i2se.com>
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotls: Fixed uninitialised vars warning
Vakul Garg [Mon, 24 Sep 2018 10:39:49 +0000 (16:09 +0530)]
tls: Fixed uninitialised vars warning

In tls_sw_sendmsg() and tls_sw_sendpage(), it is possible that the
uninitialised variable 'ret' gets passed to sk_stream_error(). So
initialise local variable 'ret' to '0. The warnings were detected by
'smatch' tool.

Fixes: b574142da9d5 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/tls: Fixed race condition in async encryption
Vakul Garg [Mon, 24 Sep 2018 10:05:56 +0000 (15:35 +0530)]
net/tls: Fixed race condition in async encryption

On processors with multi-engine crypto accelerators, it is possible that
multiple records get encrypted in parallel and their encryption
completion is notified to different cpus in multicore processor. This
leads to the situation where tls_encrypt_done() starts executing in
parallel on different cores. In current implementation, encrypted
records are queued to tx_ready_list in tls_encrypt_done(). This requires
addition to linked list 'tx_ready_list' to be protected. As
tls_decrypt_done() could be executing in irq content, it is not possible
to protect linked list addition operation using a lock.

To fix the problem, we remove linked list addition operation from the
irq context. We do tx_ready_list addition/removal operation from
application context only and get rid of possible multiple access to
the linked list. Before starting encryption on the record, we add it to
the tail of tx_ready_list. To prevent tls_tx_records() from transmitting
it, we mark the record with a new flag 'tx_ready' in 'struct tls_rec'.
When record encryption gets completed, tls_encrypt_done() has to only
update the 'tx_ready' flag to true & linked list add operation is not
required.

The changed logic brings some other side benefits. Since the records
are always submitted in tls sequence number order for encryption, the
tx_ready_list always remains sorted and addition of new records to it
does not have to traverse the linked list.

Lastly, we renamed tx_ready_list in 'struct tls_sw_context_tx' to
'tx_list'. This is because now, the some of the records at the tail are
not ready to transmit.

Fixes: b574142da9d5 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'few-NTF_ROUTER-related-updates'
David S. Miller [Mon, 24 Sep 2018 19:21:33 +0000 (12:21 -0700)]
Merge branch 'few-NTF_ROUTER-related-updates'

Roopa Prabhu says:

====================
few NTF_ROUTER related updates

This series allows setting of NTF_ROUTER by an external
entity (eg BGP E-VPN control plane). Also fixes missing
netlink notification on neigh NTF_ROUTER flag changes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoneighbour: send netlink notification if NTF_ROUTER changes
Roopa Prabhu [Sun, 23 Sep 2018 04:26:20 +0000 (21:26 -0700)]
neighbour: send netlink notification if NTF_ROUTER changes

send netlink notification if neigh_update results in NTF_ROUTER
change and if NEIGH_UPDATE_F_ISROUTER is on. Also move the
NTF_ROUTER change function into a helper.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoneighbour: allow admin to set NTF_ROUTER
Roopa Prabhu [Sun, 23 Sep 2018 04:26:19 +0000 (21:26 -0700)]
neighbour: allow admin to set NTF_ROUTER

This patch allows admin setting of NTF_ROUTER flag
on a neighbour entry. This enables external control
plane (like bgp evpn) to manage neigh entries with
NTF_ROUTER flag.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agompls: allow routes on ip6gre devices
Saif Hasan [Fri, 21 Sep 2018 21:30:05 +0000 (14:30 -0700)]
mpls: allow routes on ip6gre devices

Summary:

This appears to be necessary and sufficient change to enable `MPLS` on
`ip6gre` tunnels (RFC4023).

This diff allows IP6GRE devices to be recognized by MPLS kernel module
and hence user can configure interface to accept packets with mpls
headers as well setup mpls routes on them.

Test Plan:

Test plan consists of multiple containers connected via GRE-V6 tunnel.
Then carrying out testing steps as below.

- Carry out necessary sysctl settings on all containers

```
sysctl -w net.mpls.platform_labels=65536
sysctl -w net.mpls.ip_ttl_propagate=1
sysctl -w net.mpls.conf.lo.input=1
```

- Establish IP6GRE tunnels

```
ip -6 tunnel add name if_1_2_1 mode ip6gre \
  local 2401:db00:21:6048:feed:0::1 \
  remote 2401:db00:21:6048:feed:0::2 key 1
ip link set dev if_1_2_1 up
sysctl -w net.mpls.conf.if_1_2_1.input=1
ip -4 addr add 169.254.0.2/31 dev if_1_2_1 scope link

ip -6 tunnel add name if_1_3_1 mode ip6gre \
  local 2401:db00:21:6048:feed:0::1 \
  remote 2401:db00:21:6048:feed:0::3 key 1
ip link set dev if_1_3_1 up
sysctl -w net.mpls.conf.if_1_3_1.input=1
ip -4 addr add 169.254.0.4/31 dev if_1_3_1 scope link
```

- Install MPLS encap rules on node-1 towards node-2

```
ip route add 192.168.0.11/32 nexthop encap mpls 32/64 \
  via inet 169.254.0.3 dev if_1_2_1
```

- Install MPLS forwarding rules on node-2 and node-3
```
// node2
ip -f mpls route add 32 via inet 169.254.0.7 dev if_2_4_1

// node3
ip -f mpls route add 64 via inet 169.254.0.12 dev if_4_3_1
```

- Ping 192.168.0.11 (node4) from 192.168.0.1 (node1) (where routing
  towards 192.168.0.1 is via IP route directly towards node1 from node4)
```
ping 192.168.0.11
```

- tcpdump on interface to capture ping packets wrapped within MPLS
  header which inturn wrapped within IP6GRE header

```
16:43:41.121073 IP6
  2401:db00:21:6048:feed::1 > 2401:db00:21:6048:feed::2:
  DSTOPT GREv0, key=0x1, length 100:
  MPLS (label 32, exp 0, ttl 255) (label 64, exp 0, [S], ttl 255)
  IP 192.168.0.1 > 192.168.0.11:
  ICMP echo request, id 1208, seq 45, length 64

0x0000:  6000 2cdb 006c 3c3f 2401 db00 0021 6048  `.,..l<?$....!`H
0x0010:  feed 0000 0000 0001 2401 db00 0021 6048  ........$....!`H
0x0020:  feed 0000 0000 0002 2f00 0401 0401 0100  ......../.......
0x0030:  2000 8847 0000 0001 0002 00ff 0004 01ff  ...G............
0x0040:  4500 0054 3280 4000 ff01 c7cb c0a8 0001  E..T2.@.........
0x0050:  c0a8 000b 0800 a8d7 04b8 002d 2d3c a05b  ...........--<.[
0x0060:  0000 0000 bcd8 0100 0000 0000 1011 1213  ................
0x0070:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
0x0080:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'()*+,-./0123
0x0090:  3435 3637                                4567
```

Signed-off-by: Saif Hasan <has@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'net-sched-Add-hardware-specific-counters-to-TC-actions'
David S. Miller [Mon, 24 Sep 2018 19:18:43 +0000 (12:18 -0700)]
Merge branch 'net-sched-Add-hardware-specific-counters-to-TC-actions'

Eelco Chaudron says:

====================
net/sched: Add hardware specific counters to TC actions

Add hardware specific counters to TC actions which will be exported
through the netlink API. This makes troubleshooting TC flower offload
easier, as it possible to differentiate the packets being offloaded.

v2 - Rebased on latest net-next
====================

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/sched: Add hardware specific counters to TC actions
Eelco Chaudron [Fri, 21 Sep 2018 11:14:02 +0000 (07:14 -0400)]
net/sched: Add hardware specific counters to TC actions

Add additional counters that will store the bytes/packets processed by
hardware. These will be exported through the netlink interface for
displaying by the iproute2 tc tool

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/core: Add new basic hardware counter
Eelco Chaudron [Fri, 21 Sep 2018 11:13:54 +0000 (07:13 -0400)]
net/core: Add new basic hardware counter

Add a new hardware specific basic counter, TCA_STATS_BASIC_HW. This can
be used to count packets/bytes processed by hardware offload.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
David S. Miller [Mon, 24 Sep 2018 17:08:45 +0000 (10:08 -0700)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Daniel Borkmann says:

====================
pull-request: bpf 2018-09-24

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Several fixes for BPF sockmap to only allow sockets being attached in
   ESTABLISHED state, from John.

2) Fix up the license to LGPL/BSD for the libc compat header which contains
   fallback helpers that libbpf and bpftool is using, from Jakub.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'mvpp2-Add-txq-to-CPU-mapping'
David S. Miller [Mon, 24 Sep 2018 17:01:10 +0000 (10:01 -0700)]
Merge branch 'mvpp2-Add-txq-to-CPU-mapping'

Maxime Chevallier says:

====================
net: mvpp2: Add txq to CPU mapping

This short series adds XPS support to the mvpp2 driver, by mapping
txqs and CPUs. This comes with a patch using round-robin scheduling
for the HW to pick the next txq to transmit from, instead of the default
fixed-priority scheduling.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: mvpp2: use round-robin scheduling for TX queues on the same CPU
Maxime Chevallier [Mon, 24 Sep 2018 09:11:06 +0000 (11:11 +0200)]
net: mvpp2: use round-robin scheduling for TX queues on the same CPU

This commit allows each TXQ to be picked in a round-robin fashion by
the PPv2 transmit scheduling mechanism. This is opposed to the default
behaviour that prioritizes the highest numbered queues.

Suggested-by: Yan Markman <ymarkman@marvell.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: mvpp2: support XPS by mapping TX queues to CPUs
Maxime Chevallier [Mon, 24 Sep 2018 09:11:05 +0000 (11:11 +0200)]
net: mvpp2: support XPS by mapping TX queues to CPUs

Since the PPv2 controller has multiple TX queues, we can spread traffic
by assining TX queues to CPUs, allowing to use XPS to balance egress
traffic between CPUs.

Suggested-by : Yan Markman <ymarkman@marvell.com>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge tag 'media/v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab...
Greg Kroah-Hartman [Mon, 24 Sep 2018 13:16:41 +0000 (15:16 +0200)]
Merge tag 'media/v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media

Mauro briefly writes:
  "media fixes for v4.19-rc5

   some drivers and Kbuild fixes"

* tag 'media/v4.19-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
  media: platform: fix cros-ec-cec build error
  media: staging/media/mt9t031/Kconfig: remove bogus entry
  media: i2c: mt9v111: Fix v4l2-ctrl error handling
  media: camss: add missing includes
  media: camss: Use managed memory allocations
  media: camss: mark PM functions as __maybe_unused
  media: af9035: prevent buffer overflow on write
  media: video_function_calls.rst: drop obsolete video-set-attributes reference

6 years agonet: aquantia: memory corruption on jumbo frames
Friedemann Gerold [Sat, 15 Sep 2018 15:03:39 +0000 (18:03 +0300)]
net: aquantia: memory corruption on jumbo frames

This patch fixes skb_shared area, which will be corrupted
upon reception of 4K jumbo packets.

Originally build_skb usage purpose was to reuse page for skb to eliminate
needs of extra fragments. But that logic does not take into account that
skb_shared_info should be reserved at the end of skb data area.

In case packet data consumes all the page (4K), skb_shinfo location
overflows the page. As a consequence, __build_skb zeroed shinfo data above
the allocated page, corrupting next page.

The issue is rarely seen in real life because jumbo are normally larger
than 4K and that causes another code path to trigger.
But it 100% reproducible with simple scapy packet, like:

    sendp(IP(dst="192.168.100.3") / TCP(dport=443) \
          / Raw(RandString(size=(4096-40))), iface="enp1s0")

Fixes: 3bc2c65a1162 ("net: ethernet: aquantia: Add ring support code")
Reported-by: Friedemann Gerold <f.gerold@b-c-s.de>
Reported-by: Michael Rauch <michael@rauch.be>
Signed-off-by: Friedemann Gerold <f.gerold@b-c-s.de>
Tested-by: Nikita Danilov <nikita.danilov@aquantia.com>
Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'netpoll-avoid-capture-effects-for-NAPI-drivers'
David S. Miller [Mon, 24 Sep 2018 04:55:26 +0000 (21:55 -0700)]
Merge branch 'netpoll-avoid-capture-effects-for-NAPI-drivers'

Eric Dumazet says:

====================
netpoll: avoid capture effects for NAPI drivers

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC).

This capture, showing one ksoftirqd eating all cycles
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

It seems that all networking drivers that do use NAPI
for their TX completions, should not provide a ndo_poll_controller() :

Most NAPI drivers have netpoll support already handled
in core networking stack, since netpoll_poll_dev()
uses poll_napi(dev) to iterate through registered
NAPI contexts for a device.

This patch series take care of the first round, we will
handle other drivers in future rounds.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotun: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:52 +0000 (15:27 -0700)]
tun: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

tun uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonfp: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:51 +0000 (15:27 -0700)]
nfp: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

nfp uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Tested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agobnxt: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:50 +0000 (15:27 -0700)]
bnxt: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

bnxt uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agobnx2x: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:49 +0000 (15:27 -0700)]
bnx2x: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

bnx2x uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ariel Elior <ariel.elior@cavium.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomlx5: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:48 +0000 (15:27 -0700)]
mlx5: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

mlx5 uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomlx4: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:47 +0000 (15:27 -0700)]
mlx4: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

mlx4 uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoi40evf: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:46 +0000 (15:27 -0700)]
i40evf: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

i40evf uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoice: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:45 +0000 (15:27 -0700)]
ice: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

ice uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoigb: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:44 +0000 (15:27 -0700)]
igb: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

igb uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoixgb: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:43 +0000 (15:27 -0700)]
ixgb: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

ixgb uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

This also removes a problematic use of disable_irq() in
a context it is forbidden, as explained in commit
3d2db4198936 ("8139too: Use disable_irq_nosync() in
rtl8139_poll_controller()")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agofm10k: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:42 +0000 (15:27 -0700)]
fm10k: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
lasts for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

fm10k uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoixgbevf: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:41 +0000 (15:27 -0700)]
ixgbevf: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

ixgbevf uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoixgbe: remove ndo_poll_controller
Eric Dumazet [Fri, 21 Sep 2018 22:27:40 +0000 (15:27 -0700)]
ixgbe: remove ndo_poll_controller

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

ixgbe uses NAPI for TX completions, so we better let core
networking stack call the napi->poll() to avoid the capture.

Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Song Liu <songliubraving@fb.com>
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agobonding: use netpoll_poll_dev() helper
Eric Dumazet [Fri, 21 Sep 2018 22:27:39 +0000 (15:27 -0700)]
bonding: use netpoll_poll_dev() helper

We want to allow NAPI drivers to no longer provide
ndo_poll_controller() method, as it has been proven problematic.

team driver must not look at its presence, but instead call
netpoll_poll_dev() which factorize the needed actions.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonetpoll: make ndo_poll_controller() optional
Eric Dumazet [Fri, 21 Sep 2018 22:27:38 +0000 (15:27 -0700)]
netpoll: make ndo_poll_controller() optional

As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

It seems that all networking drivers that do use NAPI
for their TX completions, should not provide a ndo_poll_controller().

NAPI drivers have netpoll support already handled
in core networking stack, since netpoll_poll_dev()
uses poll_napi(dev) to iterate through registered
NAPI contexts for a device.

This patch allows netpoll_poll_dev() to process NAPI
contexts even for drivers not providing ndo_poll_controller(),
allowing for following patches in NAPI drivers.

Also we export netpoll_poll_dev() so that it can be called
by bonding/team drivers in following patches.

Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agomlxsw: Make MLXSW_SP1_FWREV_MINOR a hard requirement
Petr Machata [Sun, 23 Sep 2018 14:48:55 +0000 (17:48 +0300)]
mlxsw: Make MLXSW_SP1_FWREV_MINOR a hard requirement

Up until now, mlxsw tolerated firmware versions that weren't exactly
matching the required version, if the branch number matched. That
allowed the users to test various firmware versions as long as they were
on the right branch.

On the other hand, it made it impossible for mlxsw to put a hard lower
bound on a version that fixes all problems known to date. If a user had
a somewhat older FW version installed, mlxsw would start up just fine,
possibly performing non-optimally as it would use features that trigger
problematic behavior.

Therefore tweak the check to accept any FW version that is:

- on the same branch as the preferred version, and
- the same as or newer than the preferred version.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agords: Fix build regression.
David S. Miller [Sun, 23 Sep 2018 19:25:15 +0000 (12:25 -0700)]
rds: Fix build regression.

Use DECLARE_* not DEFINE_*

Fixes: cc0252f4d4eb ("RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats")
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoLinux 4.19-rc5
Greg Kroah-Hartman [Sun, 23 Sep 2018 17:15:18 +0000 (19:15 +0200)]
Linux 4.19-rc5

6 years agoMerge tag 'mfd-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Greg Kroah-Hartman [Sun, 23 Sep 2018 15:19:27 +0000 (17:19 +0200)]
Merge tag 'mfd-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd

Lee writes:
  "MFD fixes for v4.19
   - Fix Dialog DA9063 regulator constraints issue causing failure in
     probe
   - Fix OMAP Device Tree compatible strings to match DT"

* tag 'mfd-fixes-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd:
  mfd: omap-usb-host: Fix dts probe of children
  mfd: da9063: Fix DT probing with constraints

6 years agoMerge tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Sun, 23 Sep 2018 11:32:19 +0000 (13:32 +0200)]
Merge tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip

Juergen writes:
  "xen:
   Two small fixes for xen drivers."

* tag 'for-linus-4.19d-rc5-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
  xen: issue warning message when out of grant maptrack entries
  xen/x86/vpmu: Zero struct pt_regs before calling into sample handling code

6 years agoMerge tag 'for-linus-20180922' of git://git.kernel.dk/linux-block
Greg Kroah-Hartman [Sun, 23 Sep 2018 06:33:28 +0000 (08:33 +0200)]
Merge tag 'for-linus-20180922' of git://git.kernel.dk/linux-block

Jens writes:
  "Just a single fix in this pull request, fixing a regression in
  /proc/diskstats caused by the unification of timestamps."

* tag 'for-linus-20180922' of git://git.kernel.dk/linux-block:
  block: use nanosecond resolution for iostat

6 years agoMerge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Sun, 23 Sep 2018 06:10:12 +0000 (08:10 +0200)]
Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Thomas writes:
  "A set of fixes for x86:

   - Resolve the kvmclock regression on AMD systems with memory
     encryption enabled. The rework of the kvmclock memory allocation
     during early boot results in encrypted storage, which is not
     shareable with the hypervisor. Create a new section for this data
     which is mapped unencrypted and take care that the later
     allocations for shared kvmclock memory is unencrypted as well.

   - Fix the build regression in the paravirt code introduced by the
     recent spectre v2 updates.

   - Ensure that the initial static page tables cover the fixmap space
     correctly so early console always works. This worked so far by
     chance, but recent modifications to the fixmap layout can -
     depending on kernel configuration - move the relevant entries to a
     different place which is not covered by the initial static page
     tables.

   - Address the regressions and issues which got introduced with the
     recent extensions to the Intel Recource Director Technology code.

   - Update maintainer entries to document reality"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  x86/mm: Expand static page table for fixmap space
  MAINTAINERS: Add X86 MM entry
  x86/intel_rdt: Add Reinette as co-maintainer for RDT
  MAINTAINERS: Add Borislav to the x86 maintainers
  x86/paravirt: Fix some warning messages
  x86/intel_rdt: Fix incorrect loop end condition
  x86/intel_rdt: Fix exclusive mode handling of MBA resource
  x86/intel_rdt: Fix incorrect loop end condition
  x86/intel_rdt: Do not allow pseudo-locking of MBA resource
  x86/intel_rdt: Fix unchecked MSR access
  x86/intel_rdt: Fix invalid mode warning when multiple resources are managed
  x86/intel_rdt: Global closid helper to support future fixes
  x86/intel_rdt: Fix size reporting of MBA resource
  x86/intel_rdt: Fix data type in parsing callbacks
  x86/kvm: Use __bss_decrypted attribute in shared variables
  x86/mm: Add .bss..decrypted section to hold shared variables

6 years agoMerge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Sun, 23 Sep 2018 06:09:16 +0000 (08:09 +0200)]
Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Thomas writes:
  "- Provide a strerror_r wrapper so lib/bpf can be built on systems
     without _GNU_SOURCE
   - Unbreak the man page generator when building out of tree"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  perf Documentation: Fix out-of-tree asciidoctor man page generation
  tools lib bpf: Provide wrapper for strerror_r to build in !_GNU_SOURCE systems

6 years agoMerge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel...
Greg Kroah-Hartman [Sun, 23 Sep 2018 06:06:54 +0000 (08:06 +0200)]
Merge branch 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Thomas writes:
  "Make the EFI arm stub device tree loader default on to unbreak
  existing EFI boot loaders which do not have DTB support."

* 'efi-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  efi/libstub/arm: default EFI_ARMSTUB_DTB_LOADER to y

6 years agoMerge branch 'hv_netvsc-Support-LRO-RSC-in-the-vSwitch'
David S. Miller [Sun, 23 Sep 2018 00:23:16 +0000 (17:23 -0700)]
Merge branch 'hv_netvsc-Support-LRO-RSC-in-the-vSwitch'

Haiyang Zhang says:

====================
hv_netvsc: Support LRO/RSC in the vSwitch

The patch adds support for LRO/RSC in the vSwitch feature. It reduces
the per packet processing overhead by coalescing multiple TCP segments
when possible. The feature is enabled by default on VMs running on
Windows Server 2019 and later.

The patch set also adds ethtool command handler and documents.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agohv_netvsc: Update document for LRO/RSC support
Haiyang Zhang [Fri, 21 Sep 2018 18:20:37 +0000 (18:20 +0000)]
hv_netvsc: Update document for LRO/RSC support

Update document for LRO/RSC support, and the command line info to
change the setting.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agohv_netvsc: Add handler for LRO setting change
Haiyang Zhang [Fri, 21 Sep 2018 18:20:36 +0000 (18:20 +0000)]
hv_netvsc: Add handler for LRO setting change

This patch adds the handler for LRO setting change, so that a user
can use ethtool command to enable / disable LRO feature.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agohv_netvsc: Add support for LRO/RSC in the vSwitch
Haiyang Zhang [Fri, 21 Sep 2018 18:20:35 +0000 (18:20 +0000)]
hv_netvsc: Add support for LRO/RSC in the vSwitch

LRO/RSC in the vSwitch is a feature available in Windows Server 2019
hosts and later. It reduces the per packet processing overhead by
coalescing multiple TCP segments when possible. This patch adds netvsc
driver support for this feature.

Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet-ethtool: ETHTOOL_GUFO did not and should not require CAP_NET_ADMIN
Maciej Żenczykowski [Sat, 22 Sep 2018 08:34:01 +0000 (01:34 -0700)]
net-ethtool: ETHTOOL_GUFO did not and should not require CAP_NET_ADMIN

So it should not fail with EPERM even though it is no longer implemented...

This is a fix for:
  (userns)$ egrep ^Cap /proc/self/status
  CapInh: 0000003fffffffff
  CapPrm: 0000003fffffffff
  CapEff: 0000003fffffffff
  CapBnd: 0000003fffffffff
  CapAmb: 0000003fffffffff

  (userns)$ tcpdump -i usb_rndis0
  tcpdump: WARNING: usb_rndis0: SIOCETHTOOL(ETHTOOL_GUFO) ioctl failed: Operation not permitted
  Warning: Kernel filter failed: Bad file descriptor
  tcpdump: can't remove kernel filter: Bad file descriptor

With this change it returns EOPNOTSUPP instead of EPERM.

See also https://github.com/the-tcpdump-group/libpcap/issues/689

Fixes: 36a9f8e57326 "net: Remove references to NETIF_F_UFO from ethtool."
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'net-dsa-b53-SGMII-modes-fixes'
David S. Miller [Sat, 22 Sep 2018 03:01:20 +0000 (20:01 -0700)]
Merge branch 'net-dsa-b53-SGMII-modes-fixes'

Florian Fainelli says:

====================
net: dsa: b53: SGMII modes fixes

Here are two additional fixes that are required in order for SGMII to
work correctly. This was discovered with using a copper SFP which would
make us use SGMII mode, we would actually leave the HW configured in its
default mode: Fiber.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dsa: b53: Also include SGMII for mac_config and mac_link_state
Florian Fainelli [Fri, 21 Sep 2018 23:43:59 +0000 (16:43 -0700)]
net: dsa: b53: Also include SGMII for mac_config and mac_link_state

In both 802.3z and SGMII modes we need to configure the MAC accordingly
to flip between Fiber and SGMII modes, and we need to read the MAC
status from the SGMII in-band control word.

Fixes: faa543b7c51d ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dsa: b53: Fix B53_SERDES_DIGITAL_CONTROL offset
Florian Fainelli [Fri, 21 Sep 2018 23:43:58 +0000 (16:43 -0700)]
net: dsa: b53: Fix B53_SERDES_DIGITAL_CONTROL offset

Maths went wrong, to get 0x20, we need to do 0x1e + (x) * 2, not 0x18,
fix that offset so we access the correct registers. This would make us
not access the correct SerDes Digital control words, status would be
fine and so we would not be correctly flipping between Fiber and SGMII
modes resulting in incorrect status words being pulled into the SerDes
digital status register.

Fixes: faa543b7c51d ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: dsa: b53: Don't assign autonegotiation enabled
Florian Fainelli [Fri, 21 Sep 2018 22:30:05 +0000 (15:30 -0700)]
net: dsa: b53: Don't assign autonegotiation enabled

PHYLINK takes care of filing the right information into
state->an_enabled, get rid of the read from the SerDes's BMCR register.

Fixes: faa543b7c51d ("net: dsa: b53: Add SerDes support")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agodecnet: Remove unnecessary check for dev->name
Nathan Chancellor [Fri, 21 Sep 2018 19:30:34 +0000 (12:30 -0700)]
decnet: Remove unnecessary check for dev->name

Clang warns that the address of a pointer will always evaluated as true
in a boolean context.

net/decnet/dn_dev.c:1366:10: warning: address of array 'dev->name' will
always evaluate to 'true' [-Wpointer-bool-conversion]
                                dev->name ? dev->name : "???",
                                ~~~~~^~~~ ~
1 warning generated.

Link: https://github.com/ClangBuiltLinux/linux/issues/116
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoselftests/net: add ipv6 tests to ip_defrag selftest
Peter Oskolkov [Fri, 21 Sep 2018 18:17:17 +0000 (11:17 -0700)]
selftests/net: add ipv6 tests to ip_defrag selftest

This patch adds ipv6 defragmentation tests to ip_defrag selftest,
to complement existing ipv4 tests.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net
Peter Oskolkov [Fri, 21 Sep 2018 18:17:16 +0000 (11:17 -0700)]
net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net

Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
hard-limited to those of the root/init ns.

There are at least two use cases when it would be desirable to
set the high_thresh values higher in a child namespace vs the global hard
limit:

- a security/ddos protection policy may lower the thresholds in the
  root/init ns but allow for a special exception in a child namespace
- testing: a test running in a namespace may want to set these
  thresholds higher in its namespace than what is in the root/init ns

The new behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
 net.ipv6.ip6frag_high_thresh = 9000000

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 9000000

The old behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 4194304

 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
 net.ipv6.ip6frag_high_thresh = 9000000

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 4194304

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoipv6: discard IP frag queue on more errors
Peter Oskolkov [Fri, 21 Sep 2018 18:17:15 +0000 (11:17 -0700)]
ipv6: discard IP frag queue on more errors

This is similar to how ipv4 now behaves:
commit ea1fcf805697 ("ip: fail fast on IP defrag errors").

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoRDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats
Nathan Chancellor [Fri, 21 Sep 2018 18:04:51 +0000 (11:04 -0700)]
RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats

Clang warns when two declarations' section attributes don't match.

net/rds/ib_stats.c:40:1: warning: section does not match previous
declaration [-Wsection]
DEFINE_PER_CPU_SHARED_ALIGNED(struct rds_ib_statistics, rds_ib_stats);
^
./include/linux/percpu-defs.h:142:2: note: expanded from macro
'DEFINE_PER_CPU_SHARED_ALIGNED'
        DEFINE_PER_CPU_SECTION(type, name,
PER_CPU_SHARED_ALIGNED_SECTION) \
        ^
./include/linux/percpu-defs.h:93:9: note: expanded from macro
'DEFINE_PER_CPU_SECTION'
        extern __PCPU_ATTRS(sec) __typeof__(type) name;
\
               ^
./include/linux/percpu-defs.h:49:26: note: expanded from macro
'__PCPU_ATTRS'
        __percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))
\
                                ^
net/rds/ib.h:446:1: note: previous attribute is here
DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
^
./include/linux/percpu-defs.h:111:2: note: expanded from macro
'DECLARE_PER_CPU'
        DECLARE_PER_CPU_SECTION(type, name, "")
        ^
./include/linux/percpu-defs.h:87:9: note: expanded from macro
'DECLARE_PER_CPU_SECTION'
        extern __PCPU_ATTRS(sec) __typeof__(type) name
               ^
./include/linux/percpu-defs.h:49:26: note: expanded from macro
'__PCPU_ATTRS'
        __percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))
\
                                ^
1 warning generated.

The initial definition was added in commit be0bfd7201cf ("RDS/IB:
Infiniband transport") and the cache aligned definition was added in
commit b602a94c2aa9 ("RDS/IB: Stats and sysctls") right after. The
definition probably should have been updated in net/rds/ib.h, which is
what this patch does.

Link: https://github.com/ClangBuiltLinux/linux/issues/114
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet/ipv4: avoid compile error in fib_info_nh_uses_dev
Eric Dumazet [Fri, 21 Sep 2018 17:58:07 +0000 (10:58 -0700)]
net/ipv4: avoid compile error in fib_info_nh_uses_dev

net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev':
net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable]
cc1: all warnings being treated as errors

Fixes: 2857e613360d ("net/ipv4: Move device validation to helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'tcp-switch-to-Early-Departure-Time-model'
David S. Miller [Sat, 22 Sep 2018 02:38:00 +0000 (19:38 -0700)]
Merge branch 'tcp-switch-to-Early-Departure-Time-model'

Eric Dumazet says:

====================
tcp: switch to Early Departure Time model

In the early days, pacing has been implemented in sch_fq (FQ)
in a generic way :

- SO_MAX_PACING_RATE could be used by any sockets.

- TCP would vary effective pacing rate based on CWND*MSS/SRTT

- FQ would ensure delays between packets based on current
  sk->sk_pacing_rate, but with some quantum based artifacts.
  (inflating RPC tail latencies)

- BBR then tweaked the pacing rate in its various phases
  (PROBE, DRAIN, ...)

This worked reasonably well, but had the side effect that TCP RTT
samples would be inflated by the sojourn time of the packets in FQ.

Also note that when FQ is not used and TCP wants pacing, the
internal pacing fallback has very different behavior, since TCP
emits packets at the time they should be sent (with unreasonable
assumptions about scheduling costs)

Van Jacobson gave a talk at Netdev 0x12 in Montreal, about letting
TCP (or applications for UDP messages) decide of the Earliest
Departure Time, instead of letting packet schedulers derive it
from pacing rate.

https://www.netdevconf.org/0x12/session.html?evolving-from-afap-teaching-nics-about-time
https://www.files.netdevconf.org/d/46def75c2ef345809bbe/files/?p=/Evolving%20from%20AFAP%20%E2%80%93%20Teaching%20NICs%20about%20time.pdf

Recent additions in linux provided SO_TXTIME and a new ETF qdisc
supporting the new skb->tstamp role

This patch series converts TCP and FQ to the same model.

This might in the future allow us to relax tight TSQ limits
(if FQ is present in the output path), and thus lower
number of callbacks to tcp_write_xmit(), thanks to batching.

This will be followed by FQ change allowing SO_TXTIME support
so that QUIC servers can let the pacing being done in FQ (or
offloaded if network device permits)

For example, a TCP flow rated at 24Mbps now shows a more meaningful RTT

Before :

ESTAB  0  211408 10.246.7.151:41558   10.246.7.152:33723
 cubic wscale:8,8 rto:203 rtt:2.195/0.084 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:36897937
  segs_out:25488 segs_in:12454 data_segs_out:25486
  send 105.5Mbps lastsnd:1 lastrcv:12851 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 22.9Mbps
  busy:12851ms unacked:4 rcv_space:29200 notsent:205616 minrtt:0.026

After :

ESTAB  0  192584 10.246.7.151:61612   10.246.7.152:34375
 cubic wscale:8,8 rto:201 rtt:0.165/0.129 mss:1448 rcvmss:536
  advmss:1448 cwnd:20 ssthresh:20 bytes_acked:170755401
  segs_out:117931 segs_in:57651 data_segs_out:117929
  send 1404.1Mbps lastsnd:1 lastrcv:56915 lastack:1
  pacing_rate 24.0Mbps/24.0Mbps delivery_rate 24.2Mbps
  busy:56915ms unacked:4 rcv_space:29200 notsent:186792 minrtt:0.054

A nice side effect of this patch series is a reduction of max/p99
latencies of RPC workloads, since the FQ quantum no longer adds
artifact.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet_sched: sch_fq: remove dead code dealing with retransmits
Eric Dumazet [Fri, 21 Sep 2018 15:51:54 +0000 (08:51 -0700)]
net_sched: sch_fq: remove dead code dealing with retransmits

With the earliest departure time model, we no longer plan
special casing TCP retransmits. We therefore remove dead
code (since most compilers understood skb_is_retransmit()
was false)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: switch tcp_internal_pacing() to tcp_wstamp_ns
Eric Dumazet [Fri, 21 Sep 2018 15:51:53 +0000 (08:51 -0700)]
tcp: switch tcp_internal_pacing() to tcp_wstamp_ns

Now TCP keeps track of tcp_wstamp_ns, recording the earliest
departure time of next packet, we can remove duplicate code
from tcp_internal_pacing()

This removes one ktime_get_tai_ns() call, and a divide.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: switch tcp and sch_fq to new earliest departure time model
Eric Dumazet [Fri, 21 Sep 2018 15:51:52 +0000 (08:51 -0700)]
tcp: switch tcp and sch_fq to new earliest departure time model

TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.

Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.

This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.

Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.

Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE

Tested:

Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.

The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)

MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
 Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02

After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: switch internal pacing timer to CLOCK_TAI
Eric Dumazet [Fri, 21 Sep 2018 15:51:51 +0000 (08:51 -0700)]
tcp: switch internal pacing timer to CLOCK_TAI

Next patch will use tcp_wstamp_ns to feed internal
TCP pacing timer, so switch to CLOCK_TAI to share same base.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: provide earliest departure time in skb->tstamp
Eric Dumazet [Fri, 21 Sep 2018 15:51:50 +0000 (08:51 -0700)]
tcp: provide earliest departure time in skb->tstamp

Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
from usec units to nsec units.

Do not clear skb->tstamp before entering IP stacks in TX,
so that qdisc or devices can implement pacing based on the
earliest departure time instead of socket sk->sk_pacing_rate

Packets are fed with tcp_wstamp_ns, and following patch
will update tcp_wstamp_ns when both TCP and sch_fq switch to
the earliest departure time mechanism.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: add tcp_wstamp_ns socket field
Eric Dumazet [Fri, 21 Sep 2018 15:51:49 +0000 (08:51 -0700)]
tcp: add tcp_wstamp_ns socket field

TCP will soon provide earliest departure time on TX skbs.
It needs to track this in a new variable.

tcp_mstamp_refresh() needs to update this variable, and
became too big to stay an inline.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet_sched: sch_fq: switch to CLOCK_TAI
Eric Dumazet [Fri, 21 Sep 2018 15:51:48 +0000 (08:51 -0700)]
net_sched: sch_fq: switch to CLOCK_TAI

TCP will soon provide per skb->tstamp with earliest departure time,
so that sch_fq does not have to determine departure time by looking
at socket sk_pacing_rate.

We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
qdiscs, and NIC offloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: introduce tcp_skb_timestamp_us() helper
Eric Dumazet [Fri, 21 Sep 2018 15:51:47 +0000 (08:51 -0700)]
tcp: introduce tcp_skb_timestamp_us() helper

There are few places where TCP reads skb->skb_mstamp expecting
a value in usec unit.

skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value.

Add tcp_skb_timestamp_us() to provide proper conversion when needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agotcp: switch tcp_clock_ns() to CLOCK_TAI base
Eric Dumazet [Fri, 21 Sep 2018 15:51:46 +0000 (08:51 -0700)]
tcp: switch tcp_clock_ns() to CLOCK_TAI base

TCP pacing is either implemented in sch_fq or internally.
We have the goal of being able to offload pacing on the NICS.

TCP will soon provide per skb skb->tstamp as early departure time.

Like ETF in commit f37467d6ed91 ("net/sched: Introduce the ETF Qdisc")
we chose CLOCK_T as the clock base, so that TCP and pacers can share
a common clock, to get better RTT samples (without pacing artificially
inflating these samples).

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agoMerge branch 'hns3-next'
David S. Miller [Sat, 22 Sep 2018 02:29:33 +0000 (19:29 -0700)]
Merge branch 'hns3-next'

Salil Mehta says:

====================
Bug fixes, snall modifications & cleanup for HNS3 driver

This patch presents some bug fixes, small modifications and cleanups
to the HNS3 VF and PF driver.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Remove redundant hclge_get_port_type()
Peng Li [Fri, 21 Sep 2018 15:41:48 +0000 (16:41 +0100)]
net: hns3: Remove redundant hclge_get_port_type()

This patch removes hclge_get_port_type which is redundant.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Fix speed/duplex information loss problem when executing ethtool ethx...
Fuyun Liang [Fri, 21 Sep 2018 15:41:47 +0000 (16:41 +0100)]
net: hns3: Fix speed/duplex information loss problem when executing ethtool ethx cmd of VF

Our VF has not implemented the ops for get_port_type. So when we executing
ethtool ethx cmd of VF, hns3_get_link_ksettings will return directly. And
we can not query anything.

To support get_link_ksettings for VF, this patch replaces get_port_type
with get_media_type. If the media type is HNAE3_MEDIA_TYPE_NONE,
hns3_get_link_ksettings will return link information of VF.

Fixes: 05afe79cba93 ("net: hns3: Refine hns3_get_link_ksettings()")
Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Add get_media_type ops support for VF
Peng Li [Fri, 21 Sep 2018 15:41:46 +0000 (16:41 +0100)]
net: hns3: Add get_media_type ops support for VF

This patch adds the ops of get_media_type support for VF.

Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Remove print messages for error packet
Jian Shen [Fri, 21 Sep 2018 15:41:45 +0000 (16:41 +0100)]
net: hns3: Remove print messages for error packet

There are already multiple types packets statistics for error packets,
it's unnecessary to print them, which may affect the rx performance if
print too many.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Add unlikely for dma_mapping_error check
Jian Shen [Fri, 21 Sep 2018 15:41:44 +0000 (16:41 +0100)]
net: hns3: Add unlikely for dma_mapping_error check

For dma_mapping_error is unlikely happened, this patch adds unlikely for
dma_mapping_error check.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Add nic state check before calling netif_tx_wake_queue
Jian Shen [Fri, 21 Sep 2018 15:41:43 +0000 (16:41 +0100)]
net: hns3: Add nic state check before calling netif_tx_wake_queue

When nic down, it firstly calls netif_tx_stop_all_queues(), then calls
napi_disable(). But napi_disable() will wait current napi_poll finish,
it may call netif_tx_wake_queue(). This patch fixes it by add nic state
checking.

Fixes: 87d50ed74a36 ("net: hns3: Unified HNS3 {VF|PF} Ethernet Driver for hip08 SoC")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Add handle for default case
Jian Shen [Fri, 21 Sep 2018 15:41:42 +0000 (16:41 +0100)]
net: hns3: Add handle for default case

There are a few "switch-case" codes missed handle for default case. For
some abnormal case, it should return error code instead of return 0.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Unify the prefix of vf functions
Jian Shen [Fri, 21 Sep 2018 15:41:41 +0000 (16:41 +0100)]
net: hns3: Unify the prefix of vf functions

The prefix of most functions for vf are hclgevf. This patch renames the
function with inconsistent prefix.

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Fix tqp array traversal condition for vf
Jian Shen [Fri, 21 Sep 2018 15:41:40 +0000 (16:41 +0100)]
net: hns3: Fix tqp array traversal condition for vf

There are two tqp_num variables "hdev->tqp_num" and "kinfo->tqp_num"
used in VF. "hdev->tqp_num" is the total tqp number allocated to the
VF, and "kinfo->tqp_num" indicates the tqp number being used by the
VF. Usually the two variables are equal. But for the case hdev->tqp_num
larger than rss_size_max, and num_tc is 1, "kinfo->tqp_num" will be
less than "hdev->tqp_num".

In original codes, "hdev->tqp_num" is always used to traverse the
tqp array of kinfo. It may cause null pointer error when "hdev->tqp_num"
is larger than "kinfo->tqp_num"

Fixes: efa29b6d502b ("net: hns3: Add HNS3 VF HCL(Hardware Compatibility Layer) Support")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Adjust prefix of tx/rx statistic names
Jian Shen [Fri, 21 Sep 2018 15:41:39 +0000 (16:41 +0100)]
net: hns3: Adjust prefix of tx/rx statistic names

Some prefix of tx/rx statistic names are redundant, this patch modifies
these names.

The new prefix looks like below:
rxq#1_ -> rxq1_
txq#1_ -> txq1_
tx_dropped -> dropped
tx_wake -> wake
tx_busy -> busy
rx_dropped -> dropped

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Unify the type convert for desc.data
Jian Shen [Fri, 21 Sep 2018 15:41:38 +0000 (16:41 +0100)]
net: hns3: Unify the type convert for desc.data

For desc.data is already point to the address of struct member "data[6]",
it's unnecessary to use '&' to get its address. This patch unifies all
the type convert for dest.data, using "req = (struct name *)dest.data".

Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
6 years agonet: hns3: Fix ets validate issue
Jian Shen [Fri, 21 Sep 2018 15:41:37 +0000 (16:41 +0100)]
net: hns3: Fix ets validate issue

There is a defect in hclge_ets_validate(). If each member of tc_tsa is
not IEEE_8021QAZ_TSA_ETS, the variable total_ets_bw won't be updated.
In this case, the check for value of total_ets_bw will fail. This patch
fixes it by checking total_ets_bw only after it has been updated.

Fixes: f5bf9f59ecf3 ("net: hns3: Add hclge_dcb module for the support of DCB feature")
Signed-off-by: Jian Shen <shenjian15@huawei.com>
Signed-off-by: Peng Li <lipeng321@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>