Florian Westphal [Mon, 20 May 2019 11:48:10 +0000 (13:48 +0200)]
netfilter: nat: fix udp checksum corruption
Due to copy&paste error nf_nat_mangle_udp_packet passes IPPROTO_TCP,
resulting in incorrect udp checksum when payload had to be mangled.
Fixes: 5305f6e7369e8 ("netfilter: nat: remove csum_recalc hook") Reported-by: Marc Haber <mh+netdev@zugschlus.de> Tested-by: Marc Haber <mh+netdev@zugschlus.de> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Jeffrin Jose T [Wed, 15 May 2019 06:44:04 +0000 (12:14 +0530)]
selftests: netfilter: missing error check when setting up veth interface
A test for the basic NAT functionality uses ip command which needs veth
device. There is a condition where the kernel support for veth is not
compiled into the kernel and the test script breaks. This patch contains
code for reasonable error display and correct code exit.
Signed-off-by: Jeffrin Jose T <jeffrin@rajagiritech.edu.in> Acked-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
while unregistering ipvs module, ops_free_list calls
__ip_vs_cleanup, then nf_unregister_net_hooks be called to
do remove nf hook entries. It need a RCU period to finish,
however net->ipvs is set to NULL immediately, which will
trigger NULL pointer dereference when a packet is hooked
and handled by ip_vs_in where net->ipvs is dereferenced.
Another scene is ops_free_list call ops_free to free the
net_generic directly while __ip_vs_cleanup finished, then
calling ip_vs_in will triggers use-after-free.
This patch moves nf_unregister_net_hooks from __ip_vs_cleanup()
to __ip_vs_dev_cleanup(), where rcu_barrier() is called by
unregister_pernet_device -> unregister_pernet_operations,
that will do the needed grace period.
Reported-by: Hulk Robot <hulkci@huawei.com> Fixes: ab496ddea47f ("ipvs: convert to use pernet nf_hook api") Suggested-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Phil Sutter [Wed, 15 May 2019 18:15:32 +0000 (20:15 +0200)]
netfilter: nft_fib: Fix existence check support
NFTA_FIB_F_PRESENT flag was not always honored since eval functions did
not call nft_fib_store_result in all cases.
Given that in all callsites there is a struct net_device pointer
available which holds the interface data to be stored in destination
register, simplify nft_fib_store_result() to just accept that pointer
instead of the nft_pktinfo pointer and interface index. This also
allows to drop the index to interface lookup previously needed to get
the name associated with given index.
Fixes: 9e268a8ad2735 ("netfilter: nft_fib: Support existence check") Signed-off-by: Phil Sutter <phil@nwl.cc> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
This patch fixes netfilter hook traversal when there are more than 1 hooks
returning NF_QUEUE verdict. When the first queue reinjects the packet,
'nf_reinject' starts traversing hooks with a proper hook_index. However,
if it again receives a NF_QUEUE verdict (by some other netfilter hook), it
queues the packet with a wrong hook_index. So, when the second queue
reinjects the packet, it re-executes hooks in between.
Fixes: 9f62bf5133c1 ("netfilter: convert hook list to an array") Signed-off-by: Jagdish Motwani <jagdish.motwani@sophos.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Its not possible to fetch previous element in rcu-protected lists
when deletions are not prevented somehow: list_del_rcu poisons
the ->prev pointer value.
Before rcu-conversion this was safe as dump operations did hold
nfnetlink mutex.
Pass previous rule as argument, obtained by keeping a pointer to
the previous rule during traversal.
Fixes: ed33e5f4c35c59 ("netfilter: nf_tables: use call_rcu in netlink dumps") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
David S. Miller [Sat, 18 May 2019 20:13:40 +0000 (13:13 -0700)]
Merge branch 'mlxsw-Two-port-module-fixes'
Ido Schimmel says:
====================
mlxsw: Two port module fixes
Patch #1 fixes driver initialization failure on old ASICs due to
unsupported register access. This is fixed by first testing if the
register is supported.
Patch #2 fixes reading of certain modules' EEPROM. The problem and
solution are explained in detail in the commit message.
Please consider both patches for stable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Pasternak [Sat, 18 May 2019 15:58:29 +0000 (18:58 +0300)]
mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM
Prevent reading unsupported slave address from SFP EEPROM by testing
Diagnostic Monitoring Type byte in EEPROM. Read only page zero of
EEPROM, in case this byte is zero.
If some SFP transceiver does not support Digital Optical Monitoring
(DOM), reading SFP EEPROM slave address 0x51 could return an error.
Availability of DOM support is verified by reading from zero page
Diagnostic Monitoring Type byte describing how diagnostic monitoring is
implemented by transceiver. If bit 6 of this byte is set, it indicates
that digital diagnostic monitoring has been implemented. Otherwise it is
not and transceiver could fail to reply to transaction for slave address
0x51 [1010001X (A2h)], which is used to access measurements page.
Such issue has been observed when reading cable MCP2M00-xxxx,
MCP7F00-xxxx, and few others.
Fixes: d33473ecd335 ("mlxsw: spectrum: Add support for access cable info via ethtool") Fixes: 6aafcf5121dd ("mlxsw: spectrum: Fix EEPROM access in case of SFP/SFP+") Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Vadim Pasternak [Sat, 18 May 2019 15:58:28 +0000 (18:58 +0300)]
mlxsw: core: Prevent QSFP module initialization for old hardware
Old Mellanox silicons, like switchx-2, switch-ib do not support reading
QSFP modules temperature through MTMP register. Attempt to access this
register on systems equipped with the this kind of silicon will cause
initialization flow failure.
Test for hardware resource capability is added in order to distinct
between old and new silicon - old silicons do not have such capability.
Fixes: f01f5bdbdb6d ("mlxsw: core: Extend thermal module with per QSFP module thermal zones") Fixes: ca0e6eaa2181 ("mlxsw: core: Extend hwmon interface with QSFP module temperature attributes") Reported-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Vadim Pasternak <vadimp@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eli Britstein [Sun, 12 May 2019 11:50:58 +0000 (11:50 +0000)]
net/mlx5e: Fix possible modify header actions memory leak
The cited commit could disable the modify header flag, but did not free
the allocated memory for the modify header actions. Fix it.
Fixes: 61c706f0b0e00 ("net/mlx5e: Do not rewrite fields with the same match") Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eli Britstein [Wed, 10 Apr 2019 19:42:20 +0000 (19:42 +0000)]
net/mlx5e: Fix no rewrite fields with the same match
With commit 61c706f0b0e0 ("net/mlx5e: Do not rewrite fields with the
same match") there are no rewrites if the rewrite value is the same as
the matched value. However, if the field is not matched, the rewrite is
also wrongly skipped. Fix it.
Fixes: 61c706f0b0e0 ("net/mlx5e: Do not rewrite fields with the same match") Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Dmytro Linkin [Thu, 2 May 2019 15:21:38 +0000 (15:21 +0000)]
net/mlx5e: Additional check for flow destination comparison
Flow destination comparison has an inaccuracy: code see no
difference between same vf ports, which belong to different pfs.
Example: If start ping from VF0 (PF1) to VF1 (PF1) and mirror
all traffic to VF0 (PF2), icmp reply to VF0 (PF1) and mirrored
flow to VF0 (PF2) would be determined as same destination. It lead
to creating flow handler with rule nodes, which not added to node
tree. When later driver try to delete this flow rules we got
kernel crash.
Add comparison of vhca_id field to avoid this.
Fixes: 495ec8e8d34f ("net/mlx5: Consider encapsulation properties when comparing destinations") Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Dmytro Linkin [Thu, 25 Apr 2019 08:52:02 +0000 (08:52 +0000)]
net/mlx5e: Add missing ethtool driver info for representors
For all representors added firmware version info to show in
ethtool driver info.
For uplink representor, because only it is tied to the pci device
sysfs, added pci bus info.
Fixes: bf2edca555e3 ("net/mlx5e: Add some ethtool port control entries to the uplink rep netdev") Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com> Reviewed-by: Gavi Teitz <gavi@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Eli Britstein [Mon, 13 May 2019 09:06:02 +0000 (09:06 +0000)]
net/mlx5e: Fix number of vports for ingress ACL configuration
With the cited commit, ACLs are configured for the VF ports. The loop
for the number of ports had the wrong number. Fix it.
Fixes: fadf869da879 ("net/mlx5e: ACLs for priority tag mode") Signed-off-by: Eli Britstein <elibr@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Saeed Mahameed [Tue, 7 May 2019 19:59:38 +0000 (12:59 -0700)]
net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
ethtool user spaces needs to know ring count via ETHTOOL_GRXRINGS when
executing (ethtool -x) which is retrieved via ethtool get_rxnfc callback,
in mlx5 this callback is disabled when CONFIG_MLX5_EN_RXNFC=n.
This patch allows only ETHTOOL_GRXRINGS command on mlx5e_get_rxnfc() when
CONFIG_MLX5_EN_RXNFC is disabled, so ethtool -x will continue working.
net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
To avoid any ambiguity between vport index and vport number,
rename functions that had vport, to vport_num or vport_index appropriately.
vport_num is u16 hence change mlx5_eswitch_index_to_vport_num() return
type to u16.
vport_index is an int in vport array. Hence change input type of vport
index in mlx5_eswitch_index_to_vport_num() to int.
Correct multiple eswitch representor interfaces use type u16 of
rep->vport as type int vport_index.
Send vport FW commands with correct eswitch u16 vport_num instead
host int vport_index.
Fixes: fcc634ba25fa ("net/mlx5: E-Switch, Assign a different position for uplink rep and vport") Signed-off-by: Parav Pandit <parav@mellanox.com> Signed-off-by: Vu Pham <vuhuong@mellanox.com> Reviewed-by: Bodong Wang <bodong@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
net/mlx5: Add meaningful return codes to status_to_err function
Current version of function status_to_err return -1 for any
status returned by mlx5_cmd_invoke function. In case status is
MLX5_DRIVER_STATUS_ABORTED we should return 0 to the caller as we
assume command completed successfully on FW. If error returned we are
getting confusing messages in dmesg. In addition, currently returned
value -1 is confusing with -EPERM.
New implementation actually fix original commit and return meaningful
codes for commands delivery status and print message in case of failure.
Junwei Hu [Fri, 17 May 2019 11:27:34 +0000 (19:27 +0800)]
tipc: fix modprobe tipc failed after switch order of device registration
Error message printed:
modprobe: ERROR: could not insert 'tipc': Address family not
supported by protocol.
when modprobe tipc after the following patch: switch order of
device registration, commit 4e178967e483
("tipc: switch order of device registration to fix a crash")
Because sock_create_kern(net, AF_TIPC, ...) is called by
tipc_topsrv_create_listener() in the initialization process
of tipc_net_ops, tipc_socket_init() must be execute before that.
I move tipc_socket_init() into function tipc_init_net().
Fixes: 4e178967e483
("tipc: switch order of device registration to fix a crash") Signed-off-by: Junwei Hu <hujunwei4@huawei.com> Reported-by: Wang Wang <wangwang2@huawei.com> Reviewed-by: Kang Zhou <zhoukang7@huawei.com> Reviewed-by: Suanming Mou <mousuanming@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Variable 'entropy' was wrongly documented as 'seed', changed comment to
reflect actual variable name.
../lib/random32.c:179: warning: Function parameter or member 'entropy' not described in 'prandom_seed'
../lib/random32.c:179: warning: Excess function parameter 'seed' description in 'prandom_seed'
Signed-off-by: Philippe Mazenauer <philippe.mazenauer@outlook.de> Acked-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Wei Wang [Thu, 16 May 2019 20:30:54 +0000 (13:30 -0700)]
ipv6: fix src addr routing with the exception table
When inserting route cache into the exception table, the key is
generated with both src_addr and dest_addr with src addr routing.
However, current logic always assumes the src_addr used to generate the
key is a /128 host address. This is not true in the following scenarios:
1. When the route is a gateway route or does not have next hop.
(rt6_is_gw_or_nonexthop() == false)
2. When calling ip6_rt_cache_alloc(), saddr is passed in as NULL.
This means, when looking for a route cache in the exception table, we
have to do the lookup twice: first time with the passed in /128 host
address, second time with the src_addr stored in fib6_info.
This solves the pmtu discovery issue reported by Mikael Magnusson where
a route cache with a lower mtu info is created for a gateway route with
src addr. However, the lookup code is not able to find this route cache.
Fixes: 61476ec10fbd ("ipv6: hook up exception table to store dst cache") Reported-by: Mikael Magnusson <mikael.kernel@lists.m7n.se> Bisected-by: David Ahern <dsahern@gmail.com> Signed-off-by: Wei Wang <weiwan@google.com> Cc: Martin Lau <kafai@fb.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David Ahern [Thu, 16 May 2019 17:41:31 +0000 (10:41 -0700)]
selftests: pmtu.sh: Remove quotes around commands in setup_xfrm
The first command in setup_xfrm is failing resulting in the test getting
skipped:
+ ip netns exec ns-B ip -6 xfrm state add src fd00:1::a dst fd00:1::b spi 0x1000 proto esp aead 'rfc4106(gcm(aes))' 0x0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f0f 128 mode tunnel
+ out=RTNETLINK answers: Function not implemented
...
xfrm6 not supported
TEST: vti6: PMTU exceptions [SKIP]
xfrm4 not supported
TEST: vti4: PMTU exceptions [SKIP]
...
The setup command started failing when the run_cmd option was added.
Removing the quotes fixes the problem:
...
TEST: vti6: PMTU exceptions [ OK ]
TEST: vti4: PMTU exceptions [ OK ]
...
Fixes: b2e77b56ff7d ("selftests: Add debugging options to pmtu.sh") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 16 May 2019 15:09:57 +0000 (08:09 -0700)]
net: avoid weird emergency message
When host is under high stress, it is very possible thread
running netdev_wait_allrefs() returns from msleep(250)
10 seconds late.
This leads to these messages in the syslog :
[...] unregister_netdevice: waiting for syz_tun to become free. Usage count = 0
If the device refcount is zero, the wait is over.
Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
First and second original patches are already dropped from stable,
No need to stable-queue the third patch as it has no functional impact,
just a logic cleanup.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Igor Russkikh [Thu, 16 May 2019 14:52:25 +0000 (14:52 +0000)]
aqc111: cleanup mtu related logic
Original fix d7965a7dedca was done under impression that invalid data
could be written for mtu configuration higher that 16334.
But the high limit will anyway be rejected my max_mtu check in caller.
Thus, make the code cleaner and allow it doing the configuration without
checking for maximum mtu value.
Fixes: d7965a7dedca ("aqc111: fix endianness issue in aqc111_change_mtu") Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Westphal [Thu, 16 May 2019 09:28:16 +0000 (11:28 +0200)]
xfrm: ressurrect "Fix uninitialized memory read in _decode_session4"
This resurrects commit ebffdb9e99a2e9e2b
("xfrm4: Fix uninitialized memory read in _decode_session4"),
which got lost during a merge conflict resolution between ipsec-next
and net-next tree.
1c288a48e097 ("xfrm: remove decode_session indirection from afinfo_policy")
in ipsec-next moved the (buggy) _decode_session4 from
net/ipv4/xfrm4_policy.c to net/xfrm/xfrm_policy.c.
In mean time, ebffdb9e99a2e was applied to ipsec.git and fixed the
problem in the "old" location.
When the trees got merged, the moved, old function was kept.
This applies the "lost" commit again, to the new location.
Fixes: 939a2be1d864c ("Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next") Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
Andrii Nakryiko [Thu, 16 May 2019 03:39:27 +0000 (20:39 -0700)]
libbpf: move logging helpers into libbpf_internal.h
libbpf_util.h header was recently exposed as public as a dependency of
xsk.h. In addition to memory barriers, it contained logging helpers,
which are not supposed to be exposed. This patch moves those into
libbpf_internal.h, which is kept as an internal header.
Cc: Stanislav Fomichev <sdf@google.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Fixes: 68345e4ef9e4 ("libbpf: add libbpf_util.h to header install.") Signed-off-by: Andrii Nakryiko <andriin@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Junwei Hu [Thu, 16 May 2019 02:51:15 +0000 (10:51 +0800)]
tipc: switch order of device registration to fix a crash
When tipc is loaded while many processes try to create a TIPC socket,
a crash occurs:
PANIC: Unable to handle kernel paging request at virtual
address "dfff20000000021d"
pc : tipc_sk_create+0x374/0x1180 [tipc]
lr : tipc_sk_create+0x374/0x1180 [tipc]
Exception class = DABT (current EL), IL = 32 bits
Call trace:
tipc_sk_create+0x374/0x1180 [tipc]
__sock_create+0x1cc/0x408
__sys_socket+0xec/0x1f0
__arm64_sys_socket+0x74/0xa8
...
This is due to race between sock_create and unfinished
register_pernet_device. tipc_sk_insert tries to do
"net_generic(net, tipc_net_id)".
but tipc_net_id is not initialized yet.
So switch the order of the two to close the race.
This can be reproduced with multiple processes doing socket(AF_TIPC, ...)
and one process doing module removal.
Fixes: 93e14f2daba5 ("tipc: make subscriber server support net namespace") Signed-off-by: Junwei Hu <hujunwei4@huawei.com> Reported-by: Wang Wang <wangwang2@huawei.com> Reviewed-by: Xiaogang Wang <wangxiaogang3@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 16 May 2019 02:39:52 +0000 (19:39 -0700)]
ipv6: prevent possible fib6 leaks
At ipv6 route dismantle, fib6_drop_pcpu_from() is responsible
for finding all percpu routes and set their ->from pointer
to NULL, so that fib6_ref can reach its expected value (1).
The problem right now is that other cpus can still catch the
route being deleted, since there is no rcu grace period
between the route deletion and call to fib6_drop_pcpu_from()
This can leak the fib6 and associated resources, since no
notifier will take care of removing the last reference(s).
I decided to add another boolean (fib6_destroying) instead
of reusing/renaming exception_bucket_flushed to ease stable backports,
and properly document the memory barriers used to implement this fix.
This patch has been co-developped with Wei Wang.
Fixes: 87bd20a7a731 ("net/ipv6: separate handling of FIB entries from dst based routes") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Wei Wang <weiwan@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: Martin Lau <kafai@fb.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Willem de Bruijn [Wed, 15 May 2019 17:29:16 +0000 (13:29 -0400)]
net: test nouarg before dereferencing zerocopy pointers
Zerocopy skbs without completion notification were added for packet
sockets with PACKET_TX_RING user buffers. Those signal completion
through the TP_STATUS_USER bit in the ring. Zerocopy annotation was
added only to avoid premature notification after clone or orphan, by
triggering a copy on these paths for these packets.
The mechanism had to define a special "no-uarg" mode because packet
sockets already use skb_uarg(skb) == skb_shinfo(skb)->destructor_arg
for a different pointer.
Before deferencing skb_uarg(skb), verify that it is a real pointer.
Fixes: 6fc822ab0cbc7 ("packet: copy user buffers before orphan or clone") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: phy: aquantia: readd XGMII support for AQR107
XGMII interface mode no longer works on AQR107 after the recent changes,
adding back support.
Fixes: 72b205c6117e ("net: phy: aquantia: check for supported interface modes in config_init") Signed-off-by: Madalin Bucur <madalin.bucur@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bpfilter: fallback to netfilter if failed to load bpfilter kernel module
If bpfilter is not available return ENOPROTOOPT to fallback to netfilter.
Function request_module() returns both errors and userspace exit codes.
Just ignore them. Rechecking bpfilter_ops is enough.
Fixes: 442032511c36 ("net: add skeleton of bpfilter kernel module") Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
Sunil Muthuswamy [Wed, 15 May 2019 00:56:05 +0000 (00:56 +0000)]
hv_sock: Add support for delayed close
Currently, hvsock does not implement any delayed or background close
logic. Whenever the hvsock socket is closed, a FIN is sent to the peer, and
the last reference to the socket is dropped, which leads to a call to
.destruct where the socket can hang indefinitely waiting for the peer to
close it's side. The can cause the user application to hang in the close()
call.
This change implements proper STREAM(TCP) closing handshake mechanism by
sending the FIN to the peer and the waiting for the peer's FIN to arrive
for a given timeout. On timeout, it will try to terminate the connection
(i.e. a RST). This is in-line with other socket providers such as virtio.
This change does not address the hang in the vmbus_hvsock_device_unregister
where it waits indefinitely for the host to rescind the channel. That
should be taken up as a separate fix.
Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 16 May 2019 19:02:42 +0000 (12:02 -0700)]
Merge branch 'flow_offload-fix-CVLAN-support'
Edward Cree says:
====================
flow_offload: fix CVLAN support
When the flow_offload infrastructure was added, CVLAN matches weren't
plumbed through, and flow_rule_match_vlan() was incorrectly called in
the mlx5 driver when populating CVLAN match information. This series
adds flow_rule_match_cvlan(), and uses it in the mlx5 code.
Both patches should also go to 5.1 stable.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Jianbo Liu [Tue, 14 May 2019 20:18:50 +0000 (21:18 +0100)]
net/mlx5e: Fix calling wrong function to get inner vlan key and mask
When flow_rule_match_XYZ() functions were first introduced,
flow_rule_match_cvlan() for inner vlan is missing.
In mlx5_core driver, to get inner vlan key and mask, flow_rule_match_vlan()
is just called, which is wrong because it obtains outer vlan information by
FLOW_DISSECTOR_KEY_VLAN.
This commit fixes this by changing to call flow_rule_match_cvlan() after
it's added.
Fixes: d3c911931720 ("flow_offload: add flow_rule and flow_match structures and use them") Signed-off-by: Jianbo Liu <jianbol@mellanox.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yonghong Song [Thu, 16 May 2019 17:17:31 +0000 (10:17 -0700)]
tools/bpftool: move set_max_rlimit() before __bpf_object__open_xattr()
For a host which has a lower rlimit for max locked memory (e.g., 64KB),
the following error occurs in one of our production systems:
# /usr/sbin/bpftool prog load /paragon/pods/52877437/home/mark.o \
/sys/fs/bpf/paragon_mark_21 type cgroup/skb \
map idx 0 pinned /sys/fs/bpf/paragon_map_21
libbpf: Error in bpf_object__probe_name():Operation not permitted(1).
Couldn't load basic 'r0 = 0' BPF program.
Error: failed to open object file
The reason is due to low locked memory during bpf_object__probe_name()
which probes whether program name is supported in kernel or not
during __bpf_object__open_xattr().
bpftool program load already tries to relax mlock rlimit before
bpf_object__load(). Let us move set_max_rlimit() before
__bpf_object__open_xattr(), which fixed the issue here.
Fixes: a857a142a585 ("bpf, libbpf: introduce bpf_object__probe_caps to test BPF capabilities") Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Chenbo Feng [Wed, 15 May 2019 02:42:57 +0000 (19:42 -0700)]
bpf: relax inode permission check for retrieving bpf program
For iptable module to load a bpf program from a pinned location, it
only retrieve a loaded program and cannot change the program content so
requiring a write permission for it might not be necessary.
Also when adding or removing an unrelated iptable rule, it might need to
flush and reload the xt_bpf related rules as well and triggers the inode
permission check. It might be better to remove the write premission
check for the inode so we won't need to grant write access to all the
processes that flush and restore iptables rules.
Herbert Xu [Thu, 16 May 2019 07:19:48 +0000 (15:19 +0800)]
rhashtable: Fix cmpxchg RCU warnings
As cmpxchg is a non-RCU mechanism it will cause sparse warnings
when we use it for RCU. This patch adds explicit casts to silence
those warnings. This should probably be moved to RCU itself in
future.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
Herbert Xu [Thu, 16 May 2019 07:19:46 +0000 (15:19 +0800)]
rhashtable: Remove RCU marking from rhash_lock_head
The opaque type rhash_lock_head should not be marked with __rcu
because it can never be dereferenced. We should apply the RCU
marking when we turn it into a pointer which can be dereferenced.
This patch does exactly that. This fixes a number of sparse
warnings as well as getting rid of some unnecessary RCU checking.
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
The following pull-request contains BPF updates for your *net* tree.
The main changes are:
1) Fix a use after free in __dev_map_entry_free(), from Eric.
2) Several sockmap related bug fixes: a splat in strparser if
it was never initialized, remove duplicate ingress msg list
purging which can race, fix msg->sg.size accounting upon
skb to msg conversion, and last but not least fix a timeout
bug in tcp_bpf_wait_data(), from John.
3) Fix LRU map to avoid messing with eviction heuristics upon
syscall lookup, e.g. map walks from user space side will
then lead to eviction of just recently created entries on
updates as it would mark all map entries, from Daniel.
4) Don't bail out when libbpf feature probing fails. Also
various smaller fixes to flow_dissector test, from Stanislav.
5) Fix missing brackets for BTF_INT_OFFSET() in UAPI, from Gary.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
selftests/bpf: add prog detach to flow_dissector test
In case we are not running in a namespace (which we don't do by default),
let's try to detach the bpf program that we use for eth_get_headlen tests.
Fixes: 389fad8d32fc ("selftests/bpf: run flow dissector tests in skb-less mode") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
selftests/bpf: add missing \n to flow_dissector CHECK errors
Otherwise, in case of an error, everything gets mushed together.
Fixes: 4749eb87850f ("selftests/bpf: make flow dissector tests more extensible") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Otherwise libbpf is unusable from unprivileged process with
kernel.kernel.unprivileged_bpf_disabled=1.
All I get is EPERM from the probes, even if I just want to
open an ELF object and look at what progs/maps it has.
Instead of dying on probes, let's just pr_debug the error and
try to continue.
Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Eric Dumazet [Wed, 15 May 2019 16:10:15 +0000 (09:10 -0700)]
tcp: do not recycle cloned skbs
It is illegal to change arbitrary fields in skb_shared_info if the
skb is cloned.
Before calling skb_zcopy_clear() we need to ensure this rule,
therefore we need to move the test from sk_stream_alloc_skb()
to sk_wmem_free_skb()
Fixes: 8d3a8110b57e ("tcp: fix zerocopy and notsent_lowat issues") Signed-off-by: Eric Dumazet <edumazet@google.com> Diagnosed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Claudiu Manoil [Wed, 15 May 2019 16:08:56 +0000 (19:08 +0300)]
enetc: Fix NULL dma address unmap for Tx BD extensions
For the unlikely case of TxBD extensions (i.e. ptp)
the driver tries to unmap the tx_swbd corresponding
to the extension, which is bogus as it has no buffer
attached.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
If ppp_deflate fails to register in deflate_init,
module initialization failed out, however
ppp_deflate_draft may has been regiestred and not
unregistered before return.
Then the seconed modprobe will trigger crash like this.
Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Luca Ceresoli [Tue, 14 May 2019 13:23:07 +0000 (15:23 +0200)]
net: macb: fix error format in dev_err()
Errors are negative numbers. Using %u shows them as very large positive
numbers such as 4294967277 that don't make sense. Use the %d format
instead, and get a much nicer -19.
Signed-off-by: Luca Ceresoli <luca@lucaceresoli.net> Fixes: 1f0379144504 ("net: macb: Migrate to devm clock interface") Fixes: 583f67901d83 ("net/macb: unify clock management") Fixes: 2b0e952a67ac ("net/macb: merge at91_ether driver into macb driver") Fixes: a53808a6608d ("net: ethernet: macb: Add support for rx_clk") Fixes: 31b03ef42f65 ("net: macb: Support clock management for tsu_clk") Acked-by: Nicolas Ferre <nicolas.ferre@microchip.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
Sabrina Dubroca [Tue, 14 May 2019 13:12:19 +0000 (15:12 +0200)]
rtnetlink: always put IFLA_LINK for links with a link-netnsid
Currently, nla_put_iflink() doesn't put the IFLA_LINK attribute when
iflink == ifindex.
In some cases, a device can be created in a different netns with the
same ifindex as its parent. That device will not dump its IFLA_LINK
attribute, which can confuse some userspace software that expects it.
For example, if the last ifindex created in init_net and foo are both
8, these commands will trigger the issue:
ip link add parent type dummy # ifindex 9
ip link add link parent netns foo type macvlan # ifindex 9 in ns foo
So, in case a device puts the IFLA_LINK_NETNSID attribute in a dump,
always put the IFLA_LINK attribute as well.
Thanks to Dan Winship for analyzing the original OpenShift bug down to
the missing netlink attribute.
v2: change Fixes tag, it's been here forever, as Nicolas Dichtel said
add Nicolas' ack
v3: change Fixes tag
fix subject typo, spotted by Edward Cree
Analyzed-by: Dan Winship <danw@redhat.com> Fixes: 80254f791b57 ("[NET]: netlink support for moving devices between network namespaces.") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Yunjian Wang [Tue, 14 May 2019 11:03:19 +0000 (19:03 +0800)]
net/mlx4_core: Change the error print to info print
The error print within mlx4_flow_steer_promisc_add() should
be a info print.
Fixes: 3ad85e40f1e4 ('net/mlx4: Implement promiscuous mode with device managed flow-steering') Signed-off-by: Yunjian Wang <wangyunjian@huawei.com> Reviewed-by: Tariq Toukan <tariqt@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Johannes Berg [Tue, 14 May 2019 09:02:31 +0000 (11:02 +0200)]
NFC: Orphan the subsystem
Samuel clearly hasn't been working on this in many years and
patches getting to the wireless list are just being ignored
entirely now. Mark the subsystem as orphan to reflect the
current state and revert back to the netdev list so at least
some fixes can be picked up by Dave.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Samuel Ortiz <sameo@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Florian Fainelli [Mon, 13 May 2019 21:06:24 +0000 (14:06 -0700)]
net: Always descend into dsa/
Jiri reported that with a kernel built with CONFIG_FIXED_PHY=y,
CONFIG_NET_DSA=m and CONFIG_NET_DSA_LOOP=m, we would not get to a
functional state where the mock-up driver is registered. Turns out that
we are not descending into drivers/net/dsa/ unconditionally, and we
won't be able to link-in dsa_loop_bdinfo.o which does the actual mock-up
mdio device registration.
Reported-by: Jiri Pirko <jiri@resnulli.us> Fixes: 8d9f3a862d5b ("net: dsa: Fix functional dsa-loop dependency on FIXED_PHY") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Tested-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net>
Yuchung Cheng [Mon, 13 May 2019 17:32:05 +0000 (10:32 -0700)]
tcp: fix retrans timestamp on passive Fast Open
Commit c209eb9dbd68 ("tcp: properly track retry time on
passive Fast Open") sets the start of SYNACK retransmission
time on passive Fast Open in "retrans_stamp". However the
timestamp is not reset upon the handshake has completed. As a
result, future data packet retransmission may not update it in
tcp_retransmit_skb(). This may lead to socket aborting earlier
unexpectedly by retransmits_timed_out() since retrans_stamp remains
the SYNACK rtx time.
This bug only manifests on passive TFO sender that a) suffered
SYNACK timeout and then b) stalls on very first loss recovery. Any
successful loss recovery would reset the timestamp to avoid this
issue.
Fixes: c209eb9dbd68 ("tcp: properly track retry time on passive Fast Open") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Tue, 14 May 2019 21:12:59 +0000 (14:12 -0700)]
Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost
Pull virtio updates from Michael Tsirkin:
- enable packed ring support for s390
- several fixes
* tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost:
virtio/s390: enable packed ring
virtio/s390: DMA support for virtio-ccw
virtio/s390: use vring_create_virtqueue
virtio/virtio_ring: do some comment fixes
vhost-scsi: remove incorrect memory barrier
tools/virtio/ringtest: Remove bogus definition of BUG_ON()
virtio_ring: Fix potential mem leak in virtqueue_add_indirect_packed
Linus Torvalds [Tue, 14 May 2019 20:30:10 +0000 (13:30 -0700)]
Add gitignore file for samples/vfs/ generated files
Commit a240b7eb0ae1 ("vfs: Add a sample program for the new mount API")
added sample programs that get built during the kernel build, but then
cause 'git status' to worry about whether the resulting binaries should
be managed by git.
Tell git not to worry, and to ignore the sample binaries.
Linus Torvalds [Tue, 14 May 2019 20:17:19 +0000 (13:17 -0700)]
Merge branch 'parisc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull more parisc updates from Helge Deller:
"Two small enhancements, which I didn't included in the last pull
request because I wanted to keep them a few more days in for-next
before sending upstream:
- Replace the ldcw barrier instruction by a nop instruction in the
CAS code on uniprocessor machines.
- Map variables read-only after init (enable ro_after_init feature)"
* 'parisc-5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Use __ro_after_init in init.c
parisc: Use __ro_after_init in unwind.c
parisc: Use __ro_after_init in time.c
parisc: Use __ro_after_init in processor.c
parisc: Use __ro_after_init in process.c
parisc: Use __ro_after_init in perf_images.h
parisc: Use __ro_after_init in pci.c
parisc: Use __ro_after_init in inventory.c
parisc: Use __ro_after_init in head.S
parisc: Use __ro_after_init in firmware.c
parisc: Use __ro_after_init in drivers.c
parisc: Use __ro_after_init in cache.c
parisc: Enable the ro_after_init feature
parisc: Drop LDCW barrier in CAS code when running UP
Linus Torvalds [Tue, 14 May 2019 20:14:10 +0000 (13:14 -0700)]
Merge tag 'kgdb-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux
Pull kgdb updates from Daniel Thompson:
"Mostly cleanups but there are also a couple of fixes for out-of-bounds
accesses (including a potential write to the byte before a static
buffer).
The main changes are:
- Fixes to those out-of-bounds access (empty string to configure test
module could write the byte before a buffer, high cpu counts could
read outside of per-cpu structures).
- Improvements to string handling problems picked up by new compiler
warnings and other static checks. Most are fixing benign issues
that can't be tickled without code changes but still reduce the wtf
factor a little.
- Tidy up the terminal output"
* tag 'kgdb-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
kdb: Fix bound check compiler warning
kdb: do a sanity check on the cpu in kdb_per_cpu()
kdb: Get rid of broken attempt to print CCVERSION in kdb summary
misc: kgdbts: fix out-of-bounds access in function param_set_kgdbts_var
kdb: kdb_support: replace strcpy() by strscpy()
gdbstub: Replace strcpy() by strscpy()
gdbstub: mark expected switch fall-throughs
Linus Torvalds [Tue, 14 May 2019 17:55:54 +0000 (10:55 -0700)]
Merge tag 'modules-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull modules updates from Jessica Yu:
- Use a separate table to store symbol types instead of hijacking
fields in struct Elf_Sym
- Trivial code cleanups
* tag 'modules-for-v5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
module: add stubs for within_module functions
kallsyms: store type information in its own array
vmlinux.lds.h: drop unused __vermagic
====================
This set fixes LRU map eviction in combination with map lookups out
of system call side from user space. Main patch is the second one and
test cases are adapted and added in the last one. Thanks!
====================
Daniel Borkmann [Mon, 13 May 2019 23:18:57 +0000 (01:18 +0200)]
bpf: test ref bit from data path and add new tests for syscall path
The test_lru_map is relying on marking the LRU map entry via regular
BPF map lookup from system call side. This is basically for simplicity
reasons. Given we fixed marking entries in that case, the test needs
to be fixed as well. Here we add a small drop-in replacement to retain
existing behavior for the tests by marking out of the BPF program and
transferring the retrieved value out via temporary map. This also adds
new test cases to track the new behavior where two elements are marked,
one via system call side and one via program side, where the next update
then evicts the key looked up only from system call side.
Daniel Borkmann [Mon, 13 May 2019 23:18:56 +0000 (01:18 +0200)]
bpf, lru: avoid messing with eviction heuristics upon syscall lookup
One of the biggest issues we face right now with picking LRU map over
regular hash table is that a map walk out of user space, for example,
to just dump the existing entries or to remove certain ones, will
completely mess up LRU eviction heuristics and wrong entries such
as just created ones will get evicted instead. The reason for this
is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
system call lookup side as well. Thus upon walk, all entries are
being marked, so information of actual least recently used ones
are "lost".
In case of Cilium where it can be used (besides others) as a BPF
based connection tracker, this current behavior causes disruption
upon control plane changes that need to walk the map from user space
to evict certain entries. Discussion result from bpfconf [0] was that
we should simply just remove marking from system call side as no
good use case could be found where it's actually needed there.
Therefore this patch removes marking for regular LRU and per-CPU
flavor. If there ever should be a need in future, the behavior could
be selected via map creation flag, but due to mentioned reason we
avoid this here.
[0] http://vger.kernel.org/bpfconf.html
Fixes: d166b1eafbb7 ("bpf: Add BPF_MAP_TYPE_LRU_HASH") Fixes: 94dc49553dda ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Daniel Borkmann [Mon, 13 May 2019 23:18:55 +0000 (01:18 +0200)]
bpf: add map_lookup_elem_sys_only for lookups from syscall side
Add a callback map_lookup_elem_sys_only() that map implementations
could use over map_lookup_elem() from system call side in case the
map implementation needs to handle the latter differently than from
the BPF data path. If map_lookup_elem_sys_only() is set, this will
be preferred pick for map lookups out of user space. This hook is
used in a follow-up fix for LRU map, but once development window
opens, we can convert other map types from map_lookup_elem() (here,
the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
the callback to simplify and clean up the latter.
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Linus Torvalds [Tue, 14 May 2019 17:45:03 +0000 (10:45 -0700)]
Merge tag 'backlight-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight
Pull backlight updates from Lee Jones:
"Fix-ups:
- Remove unused BACKLIGHT_LCD_SUPPORT symbol
- Remove unused BACKLIGHT_CLASS_DEVICE dependencies
- Add DT support to lm3630a_bl
Bug Fixes:
- Fix error path issues in lm3630a_bl"
* tag 'backlight-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/backlight:
backlight: lm3630a: Add firmware node support
dt-bindings: backlight: Add lm3630a bindings
backlight: lm3630a: Return 0 on success in update_status functions
video: lcd: Remove useless BACKLIGHT_CLASS_DEVICE dependencies
video: backlight: Remove useless BACKLIGHT_LCD_SUPPORT kernel symbol
Linus Torvalds [Tue, 14 May 2019 17:39:08 +0000 (10:39 -0700)]
Merge tag 'mfd-next-5.2' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd
Pull MFD updates from Lee Jones:
"Core Framework:
- Document (kerneldoc) core mfd_add_devices() API
New Drivers:
- Altera SOCFPGA System Manager
- Maxim MAX77650/77651 PMIC
- Maxim MAX77663 PMIC
- ST Multi-Function eXpander (STMFX)
New Device Support:
- LEDs support in Intel Cherry Trail Whiskey Cove PMIC
- RTC support in SAMSUNG Electronics S2MPA01 PMIC
- SAM9X60 support in Atmel HLCDC (High-end LCD Controller)
- USB X-Powers AXP 8xx PMICs
- Integrated Sensor Hub (ISH) in ChromeOS EC
- USB PD Logger in ChromeOS EC
- AXP223 in X-Powers AXP series PMICs
- Power Supply in X-Powers AXP 803 PMICs
- Comet Lake in Intel Low Power Subsystem
- Fingerprint MCU in ChromeOS EC
- Touchpad MCU in ChromeOS EC
- Move TI LM3532 support to LED
New Functionality:
- max77650, max77620: Add/extend DT support
- max77620 power-off
- syscon clocking
- croc_ec host sleep event
Linus Torvalds [Tue, 14 May 2019 17:10:55 +0000 (10:10 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:
- a few misc things and hotfixes
- ocfs2
- almost all of MM
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (139 commits)
kernel/memremap.c: remove the unused device_private_entry_fault() export
mm: delete find_get_entries_tag
mm/huge_memory.c: make __thp_get_unmapped_area static
mm/mprotect.c: fix compilation warning because of unused 'mm' variable
mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
mm/vmscan: simplify trace_reclaim_flags and trace_shrink_flags
mm/Kconfig: update "Memory Model" help text
mm/vmscan.c: don't disable irq again when count pgrefill for memcg
mm: memblock: make keeping memblock memory opt-in rather than opt-out
hugetlbfs: always use address space in inode for resv_map pointer
mm/z3fold.c: support page migration
mm/z3fold.c: add structure for buddy handles
mm/z3fold.c: improve compression by extending search
mm/z3fold.c: introduce helper functions
mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist
mm/hmm: add ARCH_HAS_HMM_MIRROR ARCH_HAS_HMM_DEVICE Kconfig
mm/vmscan.c: simplify shrink_inactive_list()
fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback
xen/privcmd-buf.c: convert to use vm_map_pages_zero()
xen/gntdev.c: convert to use vm_map_pages()
...
kernel/memremap.c: remove the unused device_private_entry_fault() export
This export has been entirely unused since it was added more than 1 1/2
years ago.
Link: http://lkml.kernel.org/r/20190429115535.12793-1-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Tue, 14 May 2019 00:23:14 +0000 (17:23 -0700)]
mm/mprotect.c: fix compilation warning because of unused 'mm' variable
Since 7012160e4dcf ("mm: update ptep_modify_prot_start/commit to take
vm_area_struct as arg") the only place that uses the local 'mm' variable
in change_pte_range() is the call to set_pte_at().
Many architectures define set_pte_at() as macro that does not use the 'mm'
parameter, which generates the following compilation warning:
CC mm/mprotect.o
mm/mprotect.c: In function 'change_pte_range':
mm/mprotect.c:42:20: warning: unused variable 'mm' [-Wunused-variable]
struct mm_struct *mm = vma->vm_mm;
^~
Fix it by passing vma->mm to set_pte_at() and dropping the local 'mm'
variable in change_pte_range().
[liu.song.a23@gmail.com: fix missed conversions] Link: http://lkml.kernel.org/r/CAPhsuW6wcQgYLHNdBdw6m0YiR4RWsS4XzfpSKU7wBLLeOCTbpw@mail.gmail.comLink: Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Song Liu <liu.song.a23@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yafang Shao [Tue, 14 May 2019 00:23:11 +0000 (17:23 -0700)]
mm/page-writeback: introduce tracepoint for wait_on_page_writeback()
Recently there have been some hung tasks on our server due to
wait_on_page_writeback(), and we want to know the details of this
PG_writeback, i.e. this page is writing back to which device. But it is
not so convenient to get the details.
I think it would be better to introduce a tracepoint for diagnosing the
writeback details.
Link: http://lkml.kernel.org/r/1556274402-19018-1-git-send-email-laoar.shao@gmail.com Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Tue, 14 May 2019 00:23:05 +0000 (17:23 -0700)]
mm/Kconfig: update "Memory Model" help text
The help describing the memory model selection is outdated. It still says
that SPARSEMEM is experimental and DISCONTIGMEM is a preferred over
SPARSEMEM.
Update the help text for the relevant options:
* add a generic help for the "Memory Model" prompt
* add description for FLATMEM
* reduce the description of DISCONTIGMEM and add a deprecation note
* prefer SPARSEMEM over DISCONTIGMEM
Mike Rapoport [Tue, 14 May 2019 00:22:59 +0000 (17:22 -0700)]
mm: memblock: make keeping memblock memory opt-in rather than opt-out
Most architectures do not need the memblock memory after the page
allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
arch Kconfig.
Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
logic makes it clear which architectures actually use memblock after
system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
to the architectures that are still missing that option.
Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Russell King <linux@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Richard Kuo <rkuo@codeaurora.org> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Paul Burton <paul.burton@mips.com> Cc: James Hogan <jhogan@kernel.org> Cc: Ley Foon Tan <lftan@altera.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Rich Felker <dalias@libc.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Eric Biederman <ebiederm@xmission.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Tue, 14 May 2019 00:22:55 +0000 (17:22 -0700)]
hugetlbfs: always use address space in inode for resv_map pointer
Continuing discussion about dc432629b3ea ("hugetlbfs: fix memory leak for
resv_map") brought up the issue that inode->i_mapping may not point to the
address space embedded within the inode at inode eviction time. The
hugetlbfs truncate routine handles this by explicitly using inode->i_data.
However, code cleaning up the resv_map will still use the address space
pointed to by inode->i_mapping. Luckily, private_data is NULL for address
spaces in all such cases today but, there is no guarantee this will
continue.
Change all hugetlbfs code getting a resv_map pointer to explicitly get it
from the address space embedded within the inode. In addition, add more
comments in the code to indicate why this is being done.
Link: http://lkml.kernel.org/r/20190419204435.16984-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Yufen Yu <yuyufen@huawei.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vitaly Wool [Tue, 14 May 2019 00:22:52 +0000 (17:22 -0700)]
mm/z3fold.c: support page migration
Now that we are not using page address in handles directly, we can make
z3fold pages movable to decrease the memory fragmentation z3fold may
create over time.
This patch starts advertising non-headless z3fold pages as movable and
uses the existing kernel infrastructure to implement moving of such pages
per memory management subsystem's request. It thus implements 3 required
callbacks for page migration:
* isolation callback: z3fold_page_isolate(): try to isolate the page by
removing it from all lists. Pages scheduled for some activity and
mapped pages will not be isolated. Return true if isolation was
successful or false otherwise
* migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
memory subsystem. Returns 0 on success or negative error code otherwise
* putback callback: z3fold_page_putback(): put back the page if
z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).
[lkp@intel.com: z3fold_page_isolate() can be static] Link: http://lkml.kernel.org/r/20190419130924.GA161478@ivb42 Link: http://lkml.kernel.org/r/20190417103922.31253da5c366c4ebe0419cfc@gmail.com Signed-off-by: Vitaly Wool <vitaly.vul@sony.com> Signed-off-by: kbuild test robot <lkp@intel.com> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Krzysztof Kozlowski <k.kozlowski@samsung.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vitaly Wool [Tue, 14 May 2019 00:22:49 +0000 (17:22 -0700)]
mm/z3fold.c: add structure for buddy handles
For z3fold to be able to move its pages per request of the memory
subsystem, it should not use direct object addresses in handles. Instead,
it will create abstract handles (3 per page) which will contain pointers
to z3fold objects. Thus, it will be possible to change these pointers
when z3fold page is moved.
Vitaly Wool [Tue, 14 May 2019 00:22:46 +0000 (17:22 -0700)]
mm/z3fold.c: improve compression by extending search
The current z3fold implementation only searches this CPU's page lists for
a fitting page to put a new object into. This patch adds quick search for
very well fitting pages (i. e. those having exactly the required number
of free space) on other CPUs too, before allocating a new page for that
object.
Vitaly Wool [Tue, 14 May 2019 00:22:43 +0000 (17:22 -0700)]
mm/z3fold.c: introduce helper functions
Patch series "z3fold: support page migration", v2.
This patchset implements page migration support and slightly better buddy
search. To implement page migration support, z3fold has to move away from
the current scheme of handle encoding. i. e. stop encoding page address
in handles. Instead, a small per-page structure is created which will
contain actual addresses for z3fold objects, while pointers to fields of
that structure will be used as handles.
Thus, it will be possible to change the underlying addresses to reflect
page migration.
To support migration itself, 3 callbacks will be implemented:
1: isolation callback: z3fold_page_isolate(): try to isolate the page
by removing it from all lists. Pages scheduled for some activity and
mapped pages will not be isolated. Return true if isolation was
successful or false otherwise
2: migration callback: z3fold_page_migrate(): re-check critical
conditions and migrate page contents to the new page provided by the
system. Returns 0 on success or negative error code otherwise
3: putback callback: z3fold_page_putback(): put back the page if
z3fold_page_migrate() for it failed permanently (i. e. not with
-EAGAIN code).
To make sure an isolated page doesn't get freed, its kref is incremented
in z3fold_page_isolate() and decremented during post-migration compaction,
if migration was successful, or by z3fold_page_putback() in the other
case.
Since the new handle encoding scheme implies slight memory consumption
increase, better buddy search (which decreases memory consumption) is
included in this patchset.
This patch (of 4):
Introduce a separate helper function for object allocation, as well as 2
smaller helpers to add a buddy to the list and to get a pointer to the
pool from the z3fold header. No functional changes here.