]> git.baikalelectronics.ru Git - kernel.git/log
kernel.git
5 years agoNFSv4.1: Bump the default callback session slot count to 16
Trond Myklebust [Sat, 2 Mar 2019 15:22:56 +0000 (10:22 -0500)]
NFSv4.1: Bump the default callback session slot count to 16

Users can still control this value explicitly using the
max_session_cb_slots module parameter, but let's bump the default
up to 16 for now.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Convert remaining GFP_NOIO, and GFP_NOWAIT sites in sunrpc
Trond Myklebust [Sat, 2 Mar 2019 15:14:02 +0000 (10:14 -0500)]
SUNRPC: Convert remaining GFP_NOIO, and GFP_NOWAIT sites in sunrpc

Convert the remaining gfp_flags arguments in sunrpc to standard reclaiming
allocations, now that we set memalloc_nofs_save() as appropriate.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Clean up mirror DS initialisation
Trond Myklebust [Thu, 28 Feb 2019 16:01:05 +0000 (11:01 -0500)]
NFS/flexfiles: Clean up mirror DS initialisation

Get rid of the redundant parameter and rename the function
ff_layout_mirror_valid() to ff_layout_init_mirror_ds() for clarity.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Remove dead code in ff_layout_mirror_valid()
Trond Myklebust [Thu, 28 Feb 2019 16:35:33 +0000 (11:35 -0500)]
NFS/flexfiles: Remove dead code in ff_layout_mirror_valid()

nfs4_ff_alloc_deviceid_node() guarantees that if mirror->mirror_ds is
a valid pointer, then so is mirror->mirror_ds->ds.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfile: Simplify nfs4_ff_layout_select_ds_stateid()
Trond Myklebust [Thu, 28 Feb 2019 15:49:11 +0000 (10:49 -0500)]
NFS/flexfile: Simplify nfs4_ff_layout_select_ds_stateid()

Pass in a pointer to the mirror rather than forcing another
array access.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfile: Simplify nfs4_ff_layout_ds_version()
Trond Myklebust [Thu, 28 Feb 2019 15:44:15 +0000 (10:44 -0500)]
NFS/flexfile: Simplify nfs4_ff_layout_ds_version()

Pass in a pointer to the mirror rather than forcing another
array access.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Simplify ff_layout_get_ds_cred()
Trond Myklebust [Thu, 28 Feb 2019 15:41:57 +0000 (10:41 -0500)]
NFS/flexfiles: Simplify ff_layout_get_ds_cred()

Pass in a pointer to the mirror rather than forcing another
array access.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Simplify nfs4_ff_find_or_create_ds_client()
Trond Myklebust [Thu, 28 Feb 2019 15:38:41 +0000 (10:38 -0500)]
NFS/flexfiles: Simplify nfs4_ff_find_or_create_ds_client()

Pass in a pointer to the mirror rather than forcing another
array access.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Simplify nfs4_ff_layout_select_ds_fh()
Trond Myklebust [Thu, 28 Feb 2019 15:34:13 +0000 (10:34 -0500)]
NFS/flexfiles: Simplify nfs4_ff_layout_select_ds_fh()

Pass in a pointer to the mirror rather than having to retrieve it from
the array and then verify the resulting pointer.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Speed up read failover when DSes are down
Trond Myklebust [Thu, 14 Feb 2019 22:32:40 +0000 (17:32 -0500)]
NFS/flexfiles: Speed up read failover when DSes are down

If we notice that a DS may be down, we should attempt to read from the
other mirrors first before we go back to retry the dead DS.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Don't invalidate DS deviceids for being unresponsive
Trond Myklebust [Tue, 26 Feb 2019 18:52:50 +0000 (13:52 -0500)]
NFS/flexfiles: Don't invalidate DS deviceids for being unresponsive

If the DS is unresponsive, we want to just mark it as such, while
reporting the errors. If the server later returns the same deviceid
in a new layout, then we don't want to have to look it up again.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Remove bogus checks for invalid deviceids
Trond Myklebust [Tue, 26 Feb 2019 17:57:06 +0000 (12:57 -0500)]
NFS/flexfiles: Remove bogus checks for invalid deviceids

We already check the deviceids before we start the RPC call.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Avoid unnecessary layout invalidations
Trond Myklebust [Thu, 28 Feb 2019 04:46:04 +0000 (23:46 -0500)]
NFS/flexfiles: Avoid unnecessary layout invalidations

In ff_layout_mirror_valid() we may not want to invalidate the layout
segment despite the call to GETDEVICEINFO failing. The reason is that
a read may still be able to make progress on another mirror.

So instead we let the caller (in this case nfs4_ff_layout_prepare_ds())
decide whether or not it needs to invalidate.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: refactor calls to fs4_ff_layout_prepare_ds()
Trond Myklebust [Thu, 14 Feb 2019 18:45:45 +0000 (13:45 -0500)]
NFS/flexfiles: refactor calls to fs4_ff_layout_prepare_ds()

While we may want to skip attempting to connect to a downed mirror
when we're deciding which mirror to select for a read, we do not
want to do so once we've committed to attempting the I/O in
ff_layout_read/write_pagelist(), or ff_layout_initiate_commit()

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFSv4: Handle early exit in layoutget by returning an error
Trond Myklebust [Wed, 13 Feb 2019 12:55:31 +0000 (07:55 -0500)]
NFSv4: Handle early exit in layoutget by returning an error

If the LAYOUTGET rpc call exits early without an error, convert it to
EAGAIN.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Send LAYOUTERROR when failing over mirrored reads
Trond Myklebust [Mon, 11 Feb 2019 03:38:43 +0000 (22:38 -0500)]
NFS/flexfiles: Send LAYOUTERROR when failing over mirrored reads

When a read to the preferred mirror returns an error, the flexfiles
driver records the error in the inode list and currently marks the
layout for return before failing over the attempted read to the next
mirror.
What we actually want to do is fire off a LAYOUTERROR to notify the
MDS that there is an issue with the preferred mirror, then we fail
over. Only once we've failed to read from all mirrors should we
return the layout.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFSv4.2: Add client support for the generic 'layouterror' RPC call
Trond Myklebust [Fri, 8 Feb 2019 15:31:05 +0000 (10:31 -0500)]
NFSv4.2: Add client support for the generic 'layouterror' RPC call

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFSv4/flexfiles: Abort I/O early if the layout segment was invalidated
Trond Myklebust [Wed, 27 Feb 2019 20:37:36 +0000 (15:37 -0500)]
NFSv4/flexfiles: Abort I/O early if the layout segment was invalidated

If a layout segment gets invalidated while a pNFS I/O operation
is queued for transmission, then we ideally want to abort
immediately. This is particularly the case when there is a large
number of I/O related RPCs queued in the RPC layer, and the layout
segment gets invalidated due to an ENOSPC error, or an EACCES (because
the client was fenced). We may end up forced to spam the MDS with a
lot of otherwise unnecessary LAYOUTERRORs after that I/O fails.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFSv4/pnfs: Fix barriers in nfs4_mark_deviceid_unavailable()
Trond Myklebust [Tue, 26 Feb 2019 19:27:58 +0000 (14:27 -0500)]
NFSv4/pnfs: Fix barriers in nfs4_mark_deviceid_unavailable()

Fix the memory barriers in nfs4_mark_deviceid_unavailable() and
nfs4_test_deviceid_unavailable().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/flexfiles: Fix up sparse RCU annotations
Trond Myklebust [Tue, 26 Feb 2019 16:56:12 +0000 (11:56 -0500)]
NFS/flexfiles: Fix up sparse RCU annotations

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFSv4/flexfiles: Fix invalid deref in FF_LAYOUT_DEVID_NODE()
Trond Myklebust [Tue, 26 Feb 2019 16:19:46 +0000 (11:19 -0500)]
NFSv4/flexfiles: Fix invalid deref in FF_LAYOUT_DEVID_NODE()

If the attempt to instantiate the mirror's layout DS pointer failed,
then that pointer may hold a value of type ERR_PTR(), so we need
to check that before we dereference it.

Fixes: cbe4f330b5e0c ("pNFS/flexfiles: Fix a deadlock on LAYOUTGET")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Add missing encode / decode sequence_maxsz to v4.2 operations
Anna Schumaker [Fri, 1 Mar 2019 21:09:56 +0000 (16:09 -0500)]
NFS: Add missing encode / decode sequence_maxsz to v4.2 operations

These really should have been there from the beginning, but we never
noticed because there was enough slack in the RPC request for the extra
bytes. Chuck's recent patch to use au_cslack and au_rslack to compute
buffer size shrunk the buffer enough that this was now a problem for
SEEK operations on my test client.

Fixes: c0f162a03c425 ("nfs: Add ALLOCATE support")
Fixes: 9df45077b2b3c ("NFS: Add COPY nfs operation")
Fixes: 6e9af9b3ff9d8 ("NFS OFFLOAD_CANCEL xdr")
Fixes: 43f2ec10e06e3 ("nfs: Add DEALLOCATE support")
Fixes: 1454bf3d0ae27 ("NFS: Implement SEEK")
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFSv4.1: Don't process the sequence op more than once.
Trond Myklebust [Fri, 1 Mar 2019 16:40:05 +0000 (11:40 -0500)]
NFSv4.1: Don't process the sequence op more than once.

Ensure that if we call nfs41_sequence_process() a second time for the
same rpc_task, then we only process the results once.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFSv4.1: Reinitialise sequence results before retransmitting a request
Trond Myklebust [Fri, 1 Mar 2019 17:13:34 +0000 (12:13 -0500)]
NFSv4.1: Reinitialise sequence results before retransmitting a request

If we have to retransmit a request, we should ensure that we reinitialise
the sequence results structure, since in the event of a signal
we need to treat the request as if it had not been sent.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org
5 years agoSUNRPC: Fix an Oops in udp_poll()
Trond Myklebust [Tue, 26 Feb 2019 11:33:02 +0000 (06:33 -0500)]
SUNRPC: Fix an Oops in udp_poll()

udp_poll() checks the struct file for the O_NONBLOCK flag, so we must not
call it with a NULL file pointer.

Fixes: 5dea83578118 ("SUNRPC: Use poll() to fix up the socket requeue races")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoMerge tag 'nfs-rdma-for-5.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs
Trond Myklebust [Mon, 25 Feb 2019 13:39:26 +0000 (08:39 -0500)]
Merge tag 'nfs-rdma-for-5.1-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

NFSoRDMA client updates for 5.1

New features:
- Convert rpc auth layer to use xdr_streams
- Config option to disable insecure enctypes
- Reduce size of RPC receive buffers

Bugfixes and cleanups:
- Fix sparse warnings
- Check inline size before providing a write chunk
- Reduce the receive doorbell rate
- Various tracepoint improvements

[Trond: Fix up merge conflicts]
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS/pnfs: Bulk destroy of layouts needs to be safe w.r.t. umount
Trond Myklebust [Fri, 22 Feb 2019 19:20:27 +0000 (14:20 -0500)]
NFS/pnfs: Bulk destroy of layouts needs to be safe w.r.t. umount

If a bulk layout recall or a metadata server reboot coincides with a
umount, then holding a reference to an inode is unsafe unless we
also hold a reference to the super block.

Fixes: 7db4f5e4b0688 ("NFSv4.1: Fix bulk recall and destroy of layouts")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Fix a soft lockup in the delegation recovery code
Trond Myklebust [Thu, 21 Feb 2019 19:51:25 +0000 (14:51 -0500)]
NFS: Fix a soft lockup in the delegation recovery code

Fix a soft lockup when NFS client delegation recovery is attempted
but the inode is in the process of being freed. When the
igrab(inode) call fails, and we have to restart the recovery process,
we need to ensure that we won't attempt to recover the same delegation
again.

Fixes: 3456223d216b1 ("NFSv4.1: Test delegation stateids when server...")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFSv4.1: Avoid false retries when RPC calls are interrupted
Trond Myklebust [Wed, 20 Jun 2018 21:53:34 +0000 (17:53 -0400)]
NFSv4.1: Avoid false retries when RPC calls are interrupted

A 'false retry' in NFSv4.1 occurs when the client attempts to transmit a
new RPC call using a slot+sequence number combination that references an
already cached one. Currently, the Linux NFS client will do this if a
user process interrupts an RPC call that is in progress.
The problem with doing so is that we defeat the main mechanism used by
the server to differentiate between a new call and a replayed one. Even
if the server is able to perfectly cache the arguments of the old call,
it cannot know if the client intended to replay or send a new call.

The obvious fix is to bump the sequence number pre-emptively if an
RPC call is interrupted, but in order to deal with the corner cases
where the interrupted call is not actually received and processed by
the server, we need to interpret the error NFS4ERR_SEQ_MISORDERED
as a sign that we need to either wait or locate a correct sequence
number that lies between the value we sent, and the last value that
was acked by a SEQUENCE call on that slot.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Tested-by: Jason Tibbitts <tibbs@math.uh.edu>
5 years agoSUNRPC: Remove the redundant 'zerocopy' argument to xs_sendpages()
Trond Myklebust [Tue, 19 Feb 2019 18:21:38 +0000 (13:21 -0500)]
SUNRPC: Remove the redundant 'zerocopy' argument to xs_sendpages()

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Further cleanups of xs_sendpages()
Trond Myklebust [Tue, 19 Feb 2019 18:13:40 +0000 (13:13 -0500)]
SUNRPC: Further cleanups of xs_sendpages()

Now that we send the pages using a struct msghdr, instead of
using sendpage(), we no longer need to 'prime the socket' with
an address for unconnected UDP messages.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Convert socket page send code to use iov_iter()
Trond Myklebust [Tue, 19 Feb 2019 18:00:13 +0000 (13:00 -0500)]
SUNRPC: Convert socket page send code to use iov_iter()

Simplify the page send code using iov_iter and bvecs.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Convert xs_send_kvec() to use iov_iter_kvec()
Trond Myklebust [Tue, 19 Feb 2019 17:39:40 +0000 (12:39 -0500)]
SUNRPC: Convert xs_send_kvec() to use iov_iter_kvec()

Prepare to the socket transmission code to use iov_iter.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Initiate a connection close on an ESHUTDOWN error in stream receive
Trond Myklebust [Sat, 16 Feb 2019 14:31:47 +0000 (09:31 -0500)]
SUNRPC: Initiate a connection close on an ESHUTDOWN error in stream receive

If the client stream receive code receives an ESHUTDOWN error either
because the server closed the connection, or because it sent a
callback which cannot be processed, then we should shut down
the connection.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Don't suppress socket errors when a message read completes
Trond Myklebust [Fri, 15 Feb 2019 21:53:04 +0000 (16:53 -0500)]
SUNRPC: Don't suppress socket errors when a message read completes

If the message read completes, but the socket returned an error
condition, we should ensure to propagate that error.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Handle zero length fragments correctly
Trond Myklebust [Fri, 15 Feb 2019 23:14:27 +0000 (18:14 -0500)]
SUNRPC: Handle zero length fragments correctly

A zero length fragment is really a bug, but let's ensure we don't
go nuts when one turns up.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Don't reset the stream record info when the receive worker is running
Trond Myklebust [Wed, 20 Feb 2019 19:56:20 +0000 (14:56 -0500)]
SUNRPC: Don't reset the stream record info when the receive worker is running

To ensure that the receive worker has exclusive access to the stream record
info, we must not reset the contents other than when holding the
transport->recv_mutex.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agonfs: fix xfstest generic/099 failed on nfsv3
ZhangXiaoxu [Mon, 18 Feb 2019 14:56:43 +0000 (22:56 +0800)]
nfs: fix xfstest generic/099 failed on nfsv3

After setxattr, the nfsv3 cached the acl which set by user.

But at the backend, the shared file system (eg. ext4) will check
the acl, if it can merged with mode, it won't add acl to the file.
So, the nfsv3 cached acl is redundant.

Don't 'set_cached_acl' when setxattr.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agopNFS: Avoid read/modify/write when it is not necessary
Kazuo Ito [Thu, 14 Feb 2019 09:39:03 +0000 (18:39 +0900)]
pNFS: Avoid read/modify/write when it is not necessary

As the block and SCSI layouts can only read/write fixed-length
blocks, we must perform read-modify-write when data to be written is
not aligned to a block boundary or smaller than the block size.
(0aca8d3894f79 pnfs: add flag to force read-modify-write in ->write_begin)

The current code tries to see if we have to do read-modify-write
on block-oriented pNFS layouts by just checking !PageUptodate(page),
but the same condition also applies for overwriting of any uncached
potions of existing files, making such operations excessively slow
even it is block-aligned.

The change does not affect the optimization for modify-write-read
cases (548f3e70985b0 NFS: read-modify-write page updating),
because partial update of !PageUptodate() pages can only happen
in layouts that can do arbitrary length read/write and never
in block-based ones.

Testing results:

We ran fio on one of the pNFS clients running 4.20 kernel
(vanilla and patched) in this configuration to read/write/overwrite
files on the storage array, exported as pnfs share by the server.

 pNFS clients ---1G Ethernet--- pNFS server
 (HP DL360 G8)                  (HP DL360 G8)
       |                              |
       |                              |
       +------8G Fiber Channel--------+
                     |
               Storage Array
                 (HP P6350)

Throughput of overwrite (both buffered and O_SYNC) is noticeably
improved.

Ops.     |block size|   Throughput   |
         |  (KiB)   |    (MiB/s)     |
         |          |  4.20 | patched|
---------+----------+----------------+
buffered |         4|  21.3 |  232   |
overwrite|        32|  22.2 |  256   |
         |       512|  22.4 |  260   |
---------+----------+----------------+
O_SYNC   |         4|   3.84|    4.77|
overwrite|        32|  12.2 |   32.0 |
         |       512|  18.5 |  152   |
---------+----------+----------------+

Read and write (buffered and O_SYNC) by the same client remain unchanged
by the patch either negatively or positively, as they should do.

Ops.     |block size|   Throughput   |
         |  (KiB)   |    (MiB/s)     |
         |          |  4.20 | patched|
---------+----------+----------------+
read     |         4| 548   |  550   |
         |        32| 547   |  551   |
         |       512| 548   |  551   |
---------+----------+----------------+
buffered |         4| 237   |  244   |
write    |        32| 261   |  268   |
         |       512| 265   |  272   |
---------+----------+----------------+
O_SYNC   |         4|   0.46|    0.46|
write    |        32|   3.60|    3.57|
         |       512| 105   |  106   |
---------+----------+----------------+

Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Tested-by: Hiroyuki Watanabe <watanabe.hiroyuki@lab.ntt.co.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agopNFS: Fix potential corruption of page being written
Kazuo Ito [Thu, 14 Feb 2019 09:36:58 +0000 (18:36 +0900)]
pNFS: Fix potential corruption of page being written

nfs_want_read_modify_write() didn't check for !PagePrivate when pNFS
block or SCSI layout was in use, therefore we could lose data forever
if the page being written was filled by a read before completion.

Signed-off-by: Kazuo Ito <ito_kazuo_g3@lab.ntt.co.jp>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Fix typo in comments of nfs_readdir_alloc_pages()
zhangliguang [Sat, 16 Feb 2019 00:38:40 +0000 (08:38 +0800)]
NFS: Fix typo in comments of nfs_readdir_alloc_pages()

This fixes the typo in comments of nfs_readdir_alloc_pages().
Because nfs_readdir_large_page and nfs_readdir_free_pagearray had been
renamed.

Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Remove redundant semicolon
zhangliguang [Tue, 12 Feb 2019 01:38:33 +0000 (09:38 +0800)]
NFS: Remove redundant semicolon

This removes redundant semicolon for ending code.

Fixes: 976312181507 ("NFSv4: Fix lookup revalidate of regular files")
Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: readdirplus optimization by cache mechanism
luanshi [Tue, 29 Jan 2019 07:34:17 +0000 (15:34 +0800)]
NFS: readdirplus optimization by cache mechanism

When listing very large directories via NFS, clients may take a long
time to complete. There are about three factors involved:

First of all, ls and practically every other method of listing a
directory including python os.listdir and find rely on libc readdir().
However readdir() only reads 32K of directory entries at a time, which
means that if you have a lot of files in the same directory, it is going
to take an insanely long time to read all the directory entries.

Secondly, libc readdir() reads 32K of directory entries at a time, in
kernel space 32K buffer split into 8 pages. One NFS readdirplus rpc will
be called for one page, which introduces many readdirplus rpc calls.

Lastly, one NFS readdirplus rpc asks for 32K data (filled by nfs_dentry)
to fill one page (filled by dentry), we found that nearly one third of
data was wasted.

To solve above problems, pagecache mechanism was introduced. One NFS
readdirplus rpc will ask for a large data (more than 32k), the data can
fill more than one page, the cached pages can be used for next readdir
call. This can reduce many readdirplus rpc calls and improve readdirplus
performance.

TESTING:
When listing very large directories(include 300 thousand files) via NFS

time ls -l /nfs_mount | wc -l

without the patch:
300001
real    1m53.524s
user    0m2.314s
sys     0m2.599s

with the patch:
300001
real    0m23.487s
user    0m2.305s
sys     0m2.558s

Improved performance: 79.6%
readdirplus rpc calls decrease: 85%

Signed-off-by: Liguang Zhang <zhangliguang@linux.alibaba.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agofs/nfs: Fix nfs_parse_devname to not modify it's argument
Eric W. Biederman [Wed, 30 Jan 2019 13:58:38 +0000 (07:58 -0600)]
fs/nfs: Fix nfs_parse_devname to not modify it's argument

In the rare and unsupported case of a hostname list nfs_parse_devname
will modify dev_name.  There is no need to modify dev_name as the all
that is being computed is the length of the hostname, so the computed
length can just be shorted.

Fixes: 5bb8bfece35b ("NFS: Use common device name parsing logic for NFSv4 and NFSv2/v3")
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: remove pointless test in unx_match()
NeilBrown [Mon, 7 Jan 2019 06:53:52 +0000 (17:53 +1100)]
SUNRPC: remove pointless test in unx_match()

As reported by Dan Carpenter, this test for acred->cred being set is
inconsistent with the dereference of the pointer a few lines earlier.

An 'auth_cred' *always* has ->cred set - every place that creates one
initializes this field, often as the first thing done.

So remove this test.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: drop useless LIST_HEAD
Julia Lawall [Sun, 23 Dec 2018 08:57:10 +0000 (09:57 +0100)]
NFS: drop useless LIST_HEAD

Drop LIST_HEAD where the variable it declares has never
been used.

The semantic patch that fixes this problem is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
identifier x;
@@
- LIST_HEAD(x);
  ... when != x
// </smpl>

Fixes: e24c2b78aa0a ("NFSv4.1 Use MDS auth flavor for data server connection")
Signed-off-by: Julia Lawall <Julia.Lawall@lip6.fr>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Ensure rq_bytes_sent is reset before request transmission
Trond Myklebust [Wed, 2 Jan 2019 20:54:42 +0000 (15:54 -0500)]
SUNRPC: Ensure rq_bytes_sent is reset before request transmission

When we resend a request, ensure that the 'rq_bytes_sent' is reset
to zero.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Use poll() to fix up the socket requeue races
Trond Myklebust [Wed, 30 Jan 2019 19:51:26 +0000 (14:51 -0500)]
SUNRPC: Use poll() to fix up the socket requeue races

Because we clear XPRT_SOCK_DATA_READY before reading, we can end up
with a situation where new data arrives, causing xs_data_ready() to
queue up a second receive worker job for the same socket, which then
immediately gets stuck waiting on the transport receive mutex.
The fix is to only clear XPRT_SOCK_DATA_READY once we're done reading,
and then to use poll() to check if we might need to queue up a new
job in order to deal with any new data.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoSUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs
Trond Myklebust [Mon, 18 Feb 2019 15:02:29 +0000 (10:02 -0500)]
SUNRPC: Set memalloc_nofs_save() on all rpciod/xprtiod jobs

Set memalloc_nofs_save() on all the rpciod/xprtiod jobs so that we
ensure memory allocations for asynchronous rpc calls don't ever end
up recursing back to the NFS layer for memory reclaim.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Fix sparse annotations for nfs_set_open_stateid_locked()
Trond Myklebust [Tue, 22 Jan 2019 19:01:16 +0000 (14:01 -0500)]
NFS: Fix sparse annotations for nfs_set_open_stateid_locked()

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Fix up documentation warnings
Trond Myklebust [Mon, 18 Feb 2019 18:32:38 +0000 (13:32 -0500)]
NFS: Fix up documentation warnings

Fix up some compiler warnings about function parameters, etc not being
correctly described or formatted.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: ENOMEM should also be a fatal error.
Trond Myklebust [Wed, 13 Feb 2019 13:29:27 +0000 (08:29 -0500)]
NFS: ENOMEM should also be a fatal error.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: EINTR is also a fatal error.
Trond Myklebust [Tue, 22 Jan 2019 12:39:09 +0000 (07:39 -0500)]
NFS: EINTR is also a fatal error.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Ensure NFS writeback allocations don't recurse back into NFS.
Trond Myklebust [Mon, 18 Feb 2019 18:06:54 +0000 (13:06 -0500)]
NFS: Ensure NFS writeback allocations don't recurse back into NFS.

All the allocations that we can hit in the NFS layer and sunrpc layers
themselves are already marked as GFP_NOFS, but we need to ensure that
any calls to generic kernel functionality do the right thing as well.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Pass error information to the pgio error cleanup routine
Trond Myklebust [Wed, 13 Feb 2019 15:39:39 +0000 (10:39 -0500)]
NFS: Pass error information to the pgio error cleanup routine

Allow the caller to pass error information when cleaning up a failed
I/O request so that we can conditionally take action to cancel the
request altogether if the error turned out to be fatal.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Clean up list moves of struct nfs_page
Trond Myklebust [Mon, 18 Feb 2019 16:35:54 +0000 (11:35 -0500)]
NFS: Clean up list moves of struct nfs_page

In several places we're just moving the struct nfs_page from one list to
another by first removing from the existing list, then adding to the new
one.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
5 years agoNFS: Don't recoalesce on error in nfs_pageio_complete_mirror()
Trond Myklebust [Fri, 15 Feb 2019 21:08:25 +0000 (16:08 -0500)]
NFS: Don't recoalesce on error in nfs_pageio_complete_mirror()

If the I/O completion failed with a fatal error, then we should just
exit nfs_pageio_complete_mirror() rather than try to recoalesce.

Fixes: d13b63f1976b ("nfs: add mirroring support to pgio layer")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.0+
5 years agoNFS: Fix an I/O request leakage in nfs_do_recoalesce
Trond Myklebust [Fri, 15 Feb 2019 19:59:52 +0000 (14:59 -0500)]
NFS: Fix an I/O request leakage in nfs_do_recoalesce

Whether we need to exit early, or just reprocess the list, we
must not lost track of the request which failed to get recoalesced.

Fixes: 258d62d34e27 ("NFS: Fix a memory leak in nfs_do_recoalesce")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Cc: stable@vger.kernel.org # v4.0+
5 years agoNFS: Fix I/O request leakages
Trond Myklebust [Wed, 13 Feb 2019 14:21:38 +0000 (09:21 -0500)]
NFS: Fix I/O request leakages

When we fail to add the request to the I/O queue, we currently leave it
to the caller to free the failed request. However since some of the
requests that fail are actually created by nfs_pageio_add_request()
itself, and are not passed back the caller, this leads to a leakage
issue, which can again cause page locks to leak.

This commit addresses the leakage by freeing the created requests on
error, using desc->pg_completion_ops->error_cleanup()

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Fixes: d13b63f1976b6 ("nfs: add mirroring support to pgio layer")
Cc: stable@vger.kernel.org # v4.0: f6e2b2846d96: nfs: clean up rest of reqs
Cc: stable@vger.kernel.org # v4.0: 5ee7b765cdda: NFS41: pop some layoutget
Cc: stable@vger.kernel.org # v4.0+
5 years agoMerge tag 'sound-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
Linus Torvalds [Wed, 20 Feb 2019 17:42:52 +0000 (09:42 -0800)]
Merge tag 'sound-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound

Pull sound fixes from Takashi Iwai:
 "Here are a few last-minute fixes for 5.0.

  The most significant one is the OF-node refcount fix for ASoC
  simple-card, which could be triggered on many boards. Another fix for
  ASoC core is for the error handling in topology, while others are
  device-specific fixes for Samsung and HD-audio"

* tag 'sound-5.0' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound:
  ASoC: simple-card: fixup refcount_t underflow
  ASoC: topology: free created components in tplg load error
  ALSA: hda/realtek: Disable PC beep in passthrough on alc285
  ALSA: hda/realtek - Headset microphone and internal speaker support for System76 oryp5
  ASoC: samsung: i2s: Fix prescaler setting for the secondary DAI

5 years agoMerge tag 'pinctrl-v5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw...
Linus Torvalds [Wed, 20 Feb 2019 17:39:53 +0000 (09:39 -0800)]
Merge tag 'pinctrl-v5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl

Pull pin control fixes from Linus Walleij:
 "Some final pin control fixes (I hope) to round off the v5.0 pin
  control development cycle.

  Only driver fixes, one for stable:

   - Meson8B fixup for the sdc pins

   - Fix SDC tile position for Qualcomm QCS404"

* tag 'pinctrl-v5.0-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
  pinctrl: qcom: qcs404: Correct SDC tile

5 years agoMerge tag 'gpio-v5.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux...
Linus Torvalds [Wed, 20 Feb 2019 17:36:33 +0000 (09:36 -0800)]
Merge tag 'gpio-v5.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio

Pull GPIO fixes from Linus Walleij:
 "Two GPIO fixes for the v5.0 series:

   - Per-instance irqchip on the MT7621

   - Avoid direction setting using pin control on MMP2"

* tag 'gpio-v5.0-4' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
  gpio: pxa: avoid attempting to set pin direction via pinctrl on MMP2
  gpio: MT7621: use a per instance irq_chip structure

5 years agoMerge tag 'mtd/fixes-for-5.0-rc8' of git://git.infradead.org/linux-mtd
Linus Torvalds [Wed, 20 Feb 2019 17:16:11 +0000 (09:16 -0800)]
Merge tag 'mtd/fixes-for-5.0-rc8' of git://git.infradead.org/linux-mtd

Pull MTD fixes from Boris Brezillon:

 - Don't add a digit to MTD-backed nvmem device names

 - Make sure powernv flash names are unique

* tag 'mtd/fixes-for-5.0-rc8' of git://git.infradead.org/linux-mtd:
  mtd: powernv_flash: Fix device registration error
  mtd: Use mtd->name when registering nvmem device

5 years agoMerge branch 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorri...
Linus Torvalds [Wed, 20 Feb 2019 17:09:33 +0000 (09:09 -0800)]
Merge branch 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security

Pull keys fixes from James Morris:

 - Handle quotas better, allowing full quota to be reached.

 - Fix the creation of shortcuts in the assoc_array internal
   representation when the index key needs to be an exact multiple of
   the machine word size.

 - Fix a dependency loop between the request_key contruction record and
   the request_key authentication key. The construction record isn't
   really necessary and can be dispensed with.

 - Set the timestamp on a new key rather than leaving it as 0. This
   would ordinarily be fine - provided the system clock is never set to
   a time before 1970

* 'fixes-v5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/linux-security:
  keys: Timestamp new keys
  keys: Fix dependency loop between construction record and auth key
  assoc_array: Fix shortcut creation
  KEYS: allow reaching the keys quotas exactly

5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Linus Torvalds [Wed, 20 Feb 2019 00:13:19 +0000 (16:13 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net

Pull networking fixes from David Miller:

 1) Fix suspend and resume in mt76x0u USB driver, from Stanislaw
    Gruszka.

 2) Missing memory barriers in xsk, from Magnus Karlsson.

 3) rhashtable fixes in mac80211 from Herbert Xu.

 4) 32-bit MIPS eBPF JIT fixes from Paul Burton.

 5) Fix for_each_netdev_feature() on big endian, from Hauke Mehrtens.

 6) GSO validation fixes from Willem de Bruijn.

 7) Endianness fix for dwmac4 timestamp handling, from Alexandre Torgue.

 8) More strict checks in tcp_v4_err(), from Eric Dumazet.

 9) af_alg_release should NULL out the sk after the sock_put(), from Mao
    Wenan.

10) Missing unlock in mac80211 mesh error path, from Wei Yongjun.

11) Missing device put in hns driver, from Salil Mehta.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (44 commits)
  sky2: Increase D3 delay again
  vhost: correctly check the return value of translate_desc() in log_used()
  net: netcp: Fix ethss driver probe issue
  net: hns: Fixes the missing put_device in positive leg for roce reset
  net: stmmac: Fix a race in EEE enable callback
  qed: Fix iWARP syn packet mac address validation.
  qed: Fix iWARP buffer size provided for syn packet processing.
  r8152: Add support for MAC address pass through on RTL8153-BD
  mac80211: mesh: fix missing unlock on error in table_path_del()
  net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"
  net: crypto set sk to NULL when af_alg_release.
  net: Do not allocate page fragments that are not skb aligned
  mm: Use fixed constant in page_frag_alloc instead of size + 1
  tcp: tcp_v4_err() should be more careful
  tcp: clear icsk_backoff in tcp_write_queue_purge()
  net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
  qmi_wwan: apply SET_DTR quirk to Sierra WP7607
  net: stmmac: handle endianness in dwmac4_get_timestamp
  doc: Mention MSG_ZEROCOPY implementation for UDP
  mlxsw: __mlxsw_sp_port_headroom_set(): Fix a use of local variable
  ...

5 years agosky2: Increase D3 delay again
Kai-Heng Feng [Tue, 19 Feb 2019 15:45:29 +0000 (23:45 +0800)]
sky2: Increase D3 delay again

Another platform requires even longer delay to make the device work
correctly after S3.

So increase the delay to 300ms.

BugLink: https://bugs.launchpad.net/bugs/1798921
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agovhost: correctly check the return value of translate_desc() in log_used()
Jason Wang [Tue, 19 Feb 2019 06:53:44 +0000 (14:53 +0800)]
vhost: correctly check the return value of translate_desc() in log_used()

When fail, translate_desc() returns negative value, otherwise the
number of iovs. So we should fail when the return value is negative
instead of a blindly check against zero.

Detected by CoverityScan, CID# 1442593:  Control flow issues  (DEADCODE)

Fixes: 97d763512e67 ("vhost: log dirty page correctly")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Reported-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'asoc-fix-v5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/brooni...
Takashi Iwai [Tue, 19 Feb 2019 11:35:55 +0000 (12:35 +0100)]
Merge tag 'asoc-fix-v5.0-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus

ASoC: Fixes for v5.0

A few small fixes, a driver fix for Samsung, a fix for refcounting of
of_nodes in the simple-card driver that triggered on a lot of systems
and a fix for topology error handling.

5 years agoMerge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
David S. Miller [Tue, 19 Feb 2019 01:56:30 +0000 (17:56 -0800)]
Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf

Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for net:

1) Follow up patch to fix a compilation warning in a recent IPVS fix:
   0e46de870636 ("ipvs: fix dependency on nf_defrag_ipv6").

2) Bogus ENOENT error on flush after rule deletion in the same batch,
   reported by Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: netcp: Fix ethss driver probe issue
Murali Karicheri [Mon, 18 Feb 2019 20:10:51 +0000 (15:10 -0500)]
net: netcp: Fix ethss driver probe issue

Recent commit below has introduced a bug in netcp driver that causes
the ethss driver probe failure and thus break the networking function
on K2 SoCs such as K2HK, K2L, K2E etc. This patch fixes the issue to
restore networking on the above SoCs.

Fixes: 8a6309c0c2b0 ("net: ethernet: Convert to using %pOFn instead of device_node.name")
Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: hns: Fixes the missing put_device in positive leg for roce reset
Salil Mehta [Mon, 18 Feb 2019 17:40:32 +0000 (17:40 +0000)]
net: hns: Fixes the missing put_device in positive leg for roce reset

This patch fixes the missing device reference release-after-use in
the positive leg of the roce reset API of the HNS DSAF.

Fixes: 299bb0a32e24 ("net: hns: Fix object reference leaks in hns_dsaf_roce_reset()")
Reported-by: John Garry <john.garry@huawei.com>
Signed-off-by: Salil Mehta <salil.mehta@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge tag 'wireless-drivers-for-davem-2019-02-18' of git://git.kernel.org/pub/scm...
David S. Miller [Tue, 19 Feb 2019 01:40:47 +0000 (17:40 -0800)]
Merge tag 'wireless-drivers-for-davem-2019-02-18' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers

Kalle Valo says:

====================
wireless-drivers fixes for 5.0

Hopefully the last set of fixes for 5.0, only fix this time.

mt76

* fix regression with resume on mt76x0u USB devices
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: stmmac: Fix a race in EEE enable callback
Jose Abreu [Mon, 18 Feb 2019 13:35:03 +0000 (14:35 +0100)]
net: stmmac: Fix a race in EEE enable callback

We are saving the status of EEE even before we try to enable it. This
leads to a race with XMIT function that tries to arm EEE timer before we
set it up.

Fix this by only saving the EEE parameters after all operations are
performed with success.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Fixes: 8c0b11866f6b ("stmmac: add the Energy Efficient Ethernet support")
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'qed-iWARP'
David S. Miller [Tue, 19 Feb 2019 00:51:54 +0000 (16:51 -0800)]
Merge branch 'qed-iWARP'

Michal Kalderon says:

====================
qed: iWARP - fix some syn related issues.

This series fixes two bugs related to iWARP syn processing flow.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoqed: Fix iWARP syn packet mac address validation.
Michal Kalderon [Mon, 18 Feb 2019 13:24:03 +0000 (15:24 +0200)]
qed: Fix iWARP syn packet mac address validation.

The ll2 forwards all syn packets to the driver without validating the mac
address. Add validation check in the driver's iWARP listener flow and drop
the packet if it isn't intended for the device.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoqed: Fix iWARP buffer size provided for syn packet processing.
Michal Kalderon [Mon, 18 Feb 2019 13:24:02 +0000 (15:24 +0200)]
qed: Fix iWARP buffer size provided for syn packet processing.

The assumption that the maximum size of a syn packet is 128 bytes
is wrong. Tunneling headers were not accounted for.
Allocate buffers large enough for mtu.

Signed-off-by: Ariel Elior <ariel.elior@marvell.com>
Signed-off-by: Michal Kalderon <michal.kalderon@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoexec: load_script: Do not exec truncated interpreter path
Kees Cook [Tue, 19 Feb 2019 00:36:48 +0000 (16:36 -0800)]
exec: load_script: Do not exec truncated interpreter path

Commit a5d0c2f73138 ("exec: load_script: don't blindly truncate
shebang string") was trying to protect against a confused exec of a
truncated interpreter path. However, it was overeager and also refused
to truncate arguments as well, which broke userspace, and it was
reverted. This attempts the protection again, but allows arguments to
remain truncated. In an effort to improve readability, helper functions
and comments have been added.

Co-developed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Samuel Dionne-Riel <samuel@dionne-riel.com>
Cc: Richard Weinberger <richard.weinberger@gmail.com>
Cc: Graham Christensen <graham@grahamc.com>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
5 years agor8152: Add support for MAC address pass through on RTL8153-BD
David Chen [Sat, 16 Feb 2019 09:16:42 +0000 (17:16 +0800)]
r8152: Add support for MAC address pass through on RTL8153-BD

RTL8153-BD is used in Dell DA300 type-C dongle.
It should be added to the whitelist of devices to activate MAC address
pass through.

Per confirming with Realtek all devices containing RTL8153-BD should
activate MAC pass through and there won't use pass through bit on efuse
like in RTL8153-AD.

Signed-off-by: David Chen <david.chen7@dell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomac80211: mesh: fix missing unlock on error in table_path_del()
Wei Yongjun [Mon, 18 Feb 2019 10:29:29 +0000 (11:29 +0100)]
mac80211: mesh: fix missing unlock on error in table_path_del()

spin_lock_bh() is used in table_path_del() but rcu_read_unlock()
is used for unlocking. Fix it by using spin_unlock_bh() instead
of rcu_read_unlock() in the error handling case.

Fixes: bb0cb2d8d4c8 ("mac80211: Use linked list instead of rhashtable walk for mesh tables")
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet/mlx4_en: fix spelling mistake: "quiting" -> "quitting"
Colin Ian King [Sun, 17 Feb 2019 23:03:31 +0000 (23:03 +0000)]
net/mlx4_en: fix spelling mistake: "quiting" -> "quitting"

There is a spelling mistake in a en_err error message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: crypto set sk to NULL when af_alg_release.
Mao Wenan [Mon, 18 Feb 2019 02:44:44 +0000 (10:44 +0800)]
net: crypto set sk to NULL when af_alg_release.

KASAN has found use-after-free in sockfs_setattr.
The existed commit ba70670b00b1 ("socket: close race condition between sock_close()
and sockfs_setattr()") is to fix this simillar issue, but it seems to ignore
that crypto module forgets to set the sk to NULL after af_alg_release.

KASAN report details as below:
BUG: KASAN: use-after-free in sockfs_setattr+0x120/0x150
Write of size 4 at addr ffff88837b956128 by task syz-executor0/4186

CPU: 2 PID: 4186 Comm: syz-executor0 Not tainted xxx + #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.10.2-1ubuntu1 04/01/2014
Call Trace:
 dump_stack+0xca/0x13e
 print_address_description+0x79/0x330
 ? vprintk_func+0x5e/0xf0
 kasan_report+0x18a/0x2e0
 ? sockfs_setattr+0x120/0x150
 sockfs_setattr+0x120/0x150
 ? sock_register+0x2d0/0x2d0
 notify_change+0x90c/0xd40
 ? chown_common+0x2ef/0x510
 chown_common+0x2ef/0x510
 ? chmod_common+0x3b0/0x3b0
 ? __lock_is_held+0xbc/0x160
 ? __sb_start_write+0x13d/0x2b0
 ? __mnt_want_write+0x19a/0x250
 do_fchownat+0x15c/0x190
 ? __ia32_sys_chmod+0x80/0x80
 ? trace_hardirqs_on_thunk+0x1a/0x1c
 __x64_sys_fchownat+0xbf/0x160
 ? lockdep_hardirqs_on+0x39a/0x5e0
 do_syscall_64+0xc8/0x580
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462589
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89
f7 48 89 d6 48 89
ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3
48 c7 c1 bc ff ff
ff f7 d8 64 89 01 48
RSP: 002b:00007fb4b2c83c58 EFLAGS: 00000246 ORIG_RAX: 0000000000000104
RAX: ffffffffffffffda RBX: 000000000072bfa0 RCX: 0000000000462589
RDX: 0000000000000000 RSI: 00000000200000c0 RDI: 0000000000000007
RBP: 0000000000000005 R08: 0000000000001000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb4b2c846bc
R13: 00000000004bc733 R14: 00000000006f5138 R15: 00000000ffffffff

Allocated by task 4185:
 kasan_kmalloc+0xa0/0xd0
 __kmalloc+0x14a/0x350
 sk_prot_alloc+0xf6/0x290
 sk_alloc+0x3d/0xc00
 af_alg_accept+0x9e/0x670
 hash_accept+0x4a3/0x650
 __sys_accept4+0x306/0x5c0
 __x64_sys_accept4+0x98/0x100
 do_syscall_64+0xc8/0x580
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 4184:
 __kasan_slab_free+0x12e/0x180
 kfree+0xeb/0x2f0
 __sk_destruct+0x4e6/0x6a0
 sk_destruct+0x48/0x70
 __sk_free+0xa9/0x270
 sk_free+0x2a/0x30
 af_alg_release+0x5c/0x70
 __sock_release+0xd3/0x280
 sock_close+0x1a/0x20
 __fput+0x27f/0x7f0
 task_work_run+0x136/0x1b0
 exit_to_usermode_loop+0x1a7/0x1d0
 do_syscall_64+0x461/0x580
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Syzkaller reproducer:
r0 = perf_event_open(&(0x7f0000000000)={0x0, 0x70, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0,
0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, @perf_config_ext}, 0x0, 0x0,
0xffffffffffffffff, 0x0)
r1 = socket$alg(0x26, 0x5, 0x0)
getrusage(0x0, 0x0)
bind(r1, &(0x7f00000001c0)=@alg={0x26, 'hash\x00', 0x0, 0x0,
'sha256-ssse3\x00'}, 0x80)
r2 = accept(r1, 0x0, 0x0)
r3 = accept4$unix(r2, 0x0, 0x0, 0x0)
r4 = dup3(r3, r0, 0x0)
fchownat(r4, &(0x7f00000000c0)='\x00', 0x0, 0x0, 0x1000)

Fixes: ba70670b00b1 ("socket: close race condition between sock_close() and sockfs_setattr()")
Signed-off-by: Mao Wenan <maowenan@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoASoC: simple-card: fixup refcount_t underflow
Kuninori Morimoto [Fri, 15 Feb 2019 06:31:29 +0000 (15:31 +0900)]
ASoC: simple-card: fixup refcount_t underflow

commit f33caa3748601 ("ASoC: simple-card: merge simple-scu-card")
merged simple-card and simple-scu-card. Then it had refcount
underflow bug. This patch fixup it.
We will get below error without this patch.

OF: ERROR: Bad of_node_put() on /sound
CPU: 3 PID: 237 Comm: kworker/3:1 Not tainted 5.0.0-rc6+ #1514
Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
Workqueue: events deferred_probe_work_func
Call trace:
 dump_backtrace+0x0/0x150
 show_stack+0x24/0x30
 dump_stack+0xb0/0xec
 of_node_release+0xd0/0xd8
 kobject_put+0x74/0xe8
 of_node_put+0x24/0x30
 __of_get_next_child+0x50/0x70
 of_get_next_child+0x40/0x68
 asoc_simple_card_probe+0x604/0x730
 platform_drv_probe+0x58/0xa8
 ...
Reported-by: Vicente Bergas <vicencb@gmail.com>
Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
5 years agoASoC: topology: free created components in tplg load error
Bard liao [Sun, 17 Feb 2019 13:23:47 +0000 (21:23 +0800)]
ASoC: topology: free created components in tplg load error

Topology resources are no longer needed if any element failed to load.

Signed-off-by: Bard liao <yung-chuan.liao@linux.intel.com>
Signed-off-by: Mark Brown <broonie@kernel.org>
5 years agoMerge tag 'mailbox-fixes-v5.0-rc7' of git://git.linaro.org/landing-teams/working...
Linus Torvalds [Mon, 18 Feb 2019 18:03:19 +0000 (10:03 -0800)]
Merge tag 'mailbox-fixes-v5.0-rc7' of git://git.linaro.org/landing-teams/working/fujitsu/integration

Pull mailbox fixes from Jassi Brar:

 - API: Fix build breakge by exporting the function mbox_flush

 - BRCM: Fix FlexRM ring flush timeout issue

* tag 'mailbox-fixes-v5.0-rc7' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
  mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
  mailbox: Export mbox_flush()

5 years agoMerge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm
Linus Torvalds [Mon, 18 Feb 2019 17:59:28 +0000 (09:59 -0800)]
Merge tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm

Pull ARM fixes from Russell King:
 "A few ARM fixes:

   - Dietmar Eggemann noticed an issue with IRQ migration during CPU
     hotplug stress testing.

   - Mathieu Desnoyers noticed that a previous fix broke optimised
     kprobes.

   - Robin Murphy noticed a case where we were not clearing the dma_ops"

* tag 'for-linus' of git://git.armlinux.org.uk/~rmk/linux-arm:
  ARM: 8835/1: dma-mapping: Clear DMA ops on teardown
  ARM: 8834/1: Fix: kprobes: optimized kprobes illegal instruction
  ARM: 8824/1: fix a migrating irq bug when hotplug cpu

5 years agoMerge tag 'trace-v5.0-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt...
Linus Torvalds [Mon, 18 Feb 2019 17:40:16 +0000 (09:40 -0800)]
Merge tag 'trace-v5.0-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Two more tracing fixes

   - Have kprobes not use copy_from_user() to access kernel addresses,
     because kprobes can legitimately poke at bad kernel memory, which
     will fault. Copy from user code should never fault in kernel space.
     Using probe_mem_read() can handle kernel address space faulting.

   - Put back the entries counter in the tracing output that was
     accidentally removed"

* tag 'trace-v5.0-rc4-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix number of entries in trace header
  kprobe: Do not use uaccess functions to access kernel memory that can fault

5 years agomailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue
Rayagonda Kokatanur [Mon, 4 Feb 2019 19:21:29 +0000 (11:21 -0800)]
mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush timeout issue

RING_CONTROL reg was not written due to wrong address, hence all
the subsequent ring flush was timing out.

Fixes: da246fd5537d ("mailbox: bcm-flexrm-mailbox: Fix FlexRM ring flush sequence")
Signed-off-by: Rayagonda Kokatanur <rayagonda.kokatanur@broadcom.com>
Signed-off-by: Ray Jui <ray.jui@broadcom.com>
Reviewed-by: Scott Branden <scott.branden@broadcom.com>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
5 years agomailbox: Export mbox_flush()
Thierry Reding [Mon, 4 Feb 2019 14:07:06 +0000 (15:07 +0100)]
mailbox: Export mbox_flush()

The mbox_flush() function can be used by drivers that are built as
modules, so the function needs to be exported.

Reported-by: Mark Brown <broonie@kernel.org>
Signed-off-by: Thierry Reding <treding@nvidia.com>
Signed-off-by: Jassi Brar <jaswinder.singh@linaro.org>
5 years agoLinux 5.0-rc7
Linus Torvalds [Mon, 18 Feb 2019 02:46:40 +0000 (18:46 -0800)]
Linux 5.0-rc7

5 years agoMerge branch 'netdev-page_frag_alloc-fixes'
David S. Miller [Sun, 17 Feb 2019 23:48:43 +0000 (15:48 -0800)]
Merge branch 'netdev-page_frag_alloc-fixes'

Alexander Duyck says:

====================
Address recent issues found in netdev page_frag_alloc usage

This patch set addresses a couple of issues that I had pointed out to Jann
Horn in response to a recent patch submission.

The first issue is that I wanted to avoid the need to read/modify/write the
size value in order to generate the value for pagecnt_bias. Instead we can
just use a fixed constant which reduces the need for memory read operations
and the overall number of instructions to update the pagecnt bias values.

The other, and more important issue is, that apparently we were letting tun
access the napi_alloc_cache indirectly through netdev_alloc_frag and as a
result letting it create unaligned accesses via unaligned allocations. In
order to prevent this I have added a call to SKB_DATA_ALIGN for the fragsz
field so that we will keep the offset in the napi_alloc_cache
SMP_CACHE_BYTES aligned.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: Do not allocate page fragments that are not skb aligned
Alexander Duyck [Fri, 15 Feb 2019 22:44:18 +0000 (14:44 -0800)]
net: Do not allocate page fragments that are not skb aligned

This patch addresses the fact that there are drivers, specifically tun,
that will call into the network page fragment allocators with buffer sizes
that are not cache aligned. Doing this could result in data alignment
and DMA performance issues as these fragment pools are also shared with the
skb allocator and any other devices that will use napi_alloc_frags or
netdev_alloc_frags.

Fixes: 23f9324580f6 ("net: Split netdev_alloc_frag into __alloc_page_frag and add __napi_alloc_frag")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agomm: Use fixed constant in page_frag_alloc instead of size + 1
Alexander Duyck [Fri, 15 Feb 2019 22:44:12 +0000 (14:44 -0800)]
mm: Use fixed constant in page_frag_alloc instead of size + 1

This patch replaces the size + 1 value introduced with the recent fix for 1
byte allocs with a constant value.

The idea here is to reduce code overhead as the previous logic would have
to read size into a register, then increment it, and write it back to
whatever field was being used. By using a constant we can avoid those
memory reads and arithmetic operations in favor of just encoding the
maximum value into the operation itself.

Fixes: 2dc7caad43e2 ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs")
Signed-off-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoMerge branch 'tcp-fix-possible-crash-in-tcp_v4_err'
David S. Miller [Sun, 17 Feb 2019 23:46:59 +0000 (15:46 -0800)]
Merge branch 'tcp-fix-possible-crash-in-tcp_v4_err'

Eric Dumazet says:

====================
tcp: fix possible crash in tcp_v4_err()

soukjin bae reported a crash in tcp_v4_err() that we
root caused to a missing initialization.

Second patch adds a sanity check in tcp_v4_err() to avoid
future potential problems. Ignoring an ICMP message
is probably better than crashing a machine.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: tcp_v4_err() should be more careful
Eric Dumazet [Fri, 15 Feb 2019 21:36:21 +0000 (13:36 -0800)]
tcp: tcp_v4_err() should be more careful

ICMP handlers are not very often stressed, we should
make them more resilient to bugs that might surface in
the future.

If there is no packet in retransmit queue, we should
avoid a NULL deref.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agotcp: clear icsk_backoff in tcp_write_queue_purge()
Eric Dumazet [Fri, 15 Feb 2019 21:36:20 +0000 (13:36 -0800)]
tcp: clear icsk_backoff in tcp_write_queue_purge()

soukjin bae reported a crash in tcp_v4_err() handling
ICMP_DEST_UNREACH after tcp_write_queue_head(sk)
returned a NULL pointer.

Current logic should have prevented this :

  if (seq != tp->snd_una  || !icsk->icsk_retransmits ||
      !icsk->icsk_backoff || fastopen)
      break;

Problem is the write queue might have been purged
and icsk_backoff has not been cleared.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: soukjin bae <soukjin.bae@samsung.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()
Alexey Khoroshilov [Fri, 15 Feb 2019 21:20:54 +0000 (00:20 +0300)]
net: mv643xx_eth: disable clk on error path in mv643xx_eth_shared_probe()

If mv643xx_eth_shared_of_probe() fails, mv643xx_eth_shared_probe()
leaves clk enabled.

Found by Linux Driver Verification project (linuxtesting.org).

Signed-off-by: Alexey Khoroshilov <khoroshilov@ispras.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agoqmi_wwan: apply SET_DTR quirk to Sierra WP7607
Beniamino Galvani [Fri, 15 Feb 2019 12:20:42 +0000 (13:20 +0100)]
qmi_wwan: apply SET_DTR quirk to Sierra WP7607

The 1199:68C0 USB ID is reused by Sierra WP7607 which requires the DTR
quirk to be detected. Apply QMI_QUIRK_SET_DTR unconditionally as
already done for other IDs shared between different devices.

Signed-off-by: Beniamino Galvani <bgalvani@redhat.com>
Acked-by: Bjørn Mork <bjorn@mork.no>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agonet: stmmac: handle endianness in dwmac4_get_timestamp
Alexandre Torgue [Fri, 15 Feb 2019 09:49:09 +0000 (10:49 +0100)]
net: stmmac: handle endianness in dwmac4_get_timestamp

GMAC IP is little-endian and used on several kind of CPU (big or little
endian). Main callbacks functions of the stmmac drivers take care about
it. It was not the case for dwmac4_get_timestamp function.

Fixes: 90e43448887a ("stmmac: fix PTP support for GMAC4")
Signed-off-by: Alexandre Torgue <alexandre.torgue@st.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agodoc: Mention MSG_ZEROCOPY implementation for UDP
Petr Vorel [Thu, 14 Feb 2019 23:43:27 +0000 (00:43 +0100)]
doc: Mention MSG_ZEROCOPY implementation for UDP

MSG_ZEROCOPY implementation for UDP was merged in v5.0,
b3685b18af37 ("Merge branch 'udp-msg_zerocopy'").

Signed-off-by: Petr Vorel <pvorel@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
5 years agopinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins
Martin Blumenstingl [Sat, 9 Feb 2019 01:01:01 +0000 (02:01 +0100)]
pinctrl: meson: meson8b: fix the sdxc_a data 1..3 pins

Fix the mismatch between the "sdxc_d13_1_a" pin group definition from
meson8b_cbus_groups and the entry in sdxc_a_groups ("sdxc_d0_13_1_a").
This makes it possible to use "sdxc_d13_1_a" in device-tree files to
route the MMC data 1..3 pins to GPIOX_1..3.

Fixes: f1567799547358 ("pinctrl: Add support for Meson8b")
Signed-off-by: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Signed-off-by: Linus Walleij <linus.walleij@linaro.org>