mm/vmstat.c: fix /proc/vmstat format for CONFIG_DEBUG_TLBFLUSH=y CONFIG_SMP=n
Commit 681068f4a6c0 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
depends on skipping vmstat entries with empty name introduced in 7877cc9afee7 ("mm: don't show nr_indirectly_reclaimable in
/proc/vmstat") but reverted in f59f2678f01a ("mm: rename and change
semantics of nr_indirectly_reclaimable_bytes").
So skipping no longer works and /proc/vmstat has misformatted lines " 0".
This patch simply shows debug counters "nr_tlb_remote_*" for UP.
mm/memory_hotplug: do not unlock after failing to take the device_hotplug_lock
When adding memory by probing a memory block in the sysfs interface,
there is an obvious issue where we will unlock the device_hotplug_lock
when we failed to takes it.
That issue was introduced in 6c615b6bae48 ("mm/memory_hotplug: make
add_memory() take the device_hotplug_lock").
We should drop out in time when failing to take the device_hotplug_lock.
Link: http://lkml.kernel.org/r/1554696437-9593-1-git-send-email-zhongjiang@huawei.com Fixes: 6c615b6bae48 ("mm/memory_hotplug: make add_memory() take the device_hotplug_lock") Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reported-by: Yang yingliang <yangyingliang@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: swapoff: shmem_unuse() stop eviction without igrab()
The igrab() in shmem_unuse() looks good, but we forgot that it gives no
protection against concurrent unmounting: a point made by Konstantin
Khlebnikov eight years ago, and then fixed in 2.6.39 by 3ef34e05ad47
("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
Self-destruct in 5 seconds. Have a nice day..." followed by GPF.
Once again, give up on using igrab(); but don't go back to making such
heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
the new design, and I expect could deadlock inside shmem_swapin_page().
Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
specific inode, and shmem_evict_inode() wait for that to go down to 0.
Call it "stop_eviction" rather than "swapoff_busy" because it can be put
to use for others later (huge tmpfs patches expect to use it).
That simplifies shmem_unuse(), protecting it from both unlink and
unmount; and in practice lets it locate all the swap in its first try.
But do not rely on that: there's still a theoretical case, when
shmem_writepage() might have been preempted after its get_swap_page(),
before making the swap entry visible to swapoff.
mm: swapoff: remove too limiting SWAP_UNUSE_MAX_TRIES
SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
further testing has proved it to be a source of unnecessary swapoff
EBUSY failures (which can then be followed by unmount EBUSY failures).
When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
or inode being evicted, freeing up swap independent of try_to_unuse().
Those typically completed much sooner than the old quadratic swapoff,
but now it's more common that swapoff may need to wait for them.
It's possible to move those cases from init_mm.mmlist and shmem_swaplist
to separate "exiting" swaplists, and try_to_unuse() then wait for those
lists to be emptied; but we've not bothered with that in the past, and
don't want to risk missing some other forgotten case. So just revert to
cycling around until the swap is gone, without any retries limit.
mm: swapoff: shmem_find_swap_entries() filter out other types
Swapfile "type" was passed all the way down to shmem_unuse_inode(), but
then forgotten from shmem_find_swap_entries(): with the result that
removing one swapfile would try to free up all the swap from shmem - no
problem when only one swapfile anyway, but counter-productive when more,
causing swapoff to be unnecessarily OOM-killed when it should succeed.
Qian Cai [Fri, 19 Apr 2019 00:49:55 +0000 (17:49 -0700)]
slab: store tagged freelist for off-slab slabmgmt
Commit c30cac8dff00 ("kasan, slab: make freelist stored without tags")
calls kasan_reset_tag() for off-slab slab management object leading to
freelist being stored non-tagged.
However, cache_grow_begin() calls alloc_slabmgmt() which calls
kmem_cache_alloc_node() assigns a tag for the address and stores it in
the shadow address. As the result, it causes endless errors below
during boot due to drain_freelist() -> slab_destroy() ->
kasan_slab_free() which compares already untagged freelist against the
stored tag in the shadow address.
Since off-slab slab management object freelist is such a special case,
just store it tagged. Non-off-slab management object freelist is still
stored untagged which has not been assigned a tag and should not cause
any other troubles with this inconsistency.
BUG: KASAN: double-free or invalid-free in slab_destroy+0x84/0x88
Pointer tag: [ff], memory tag: [99]
The buggy address belongs to the object at ffff809681b89e00
which belongs to the cache kmalloc-128 of size 128
The buggy address is located 0 bytes inside of
128-byte region [ffff809681b89e00, ffff809681b89e80)
The buggy address belongs to the page:
page:ffff7fe025a06e00 count:1 mapcount:0 mapping:01ff80082000fb00
index:0xffff809681b8fe04
flags: 0x17ffffffc000200(slab)
raw: 017ffffffc000200ffff7fe025a06d08ffff7fe022ef7b8801ff80082000fb00
raw: ffff809681b8fe04ffff809681b8000000000001000000e00000000000000000
page dumped because: kasan: bad access detected
page allocated via order 0, migratetype Unmovable, gfp_mask
0x2420c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE)
prep_new_page+0x4e0/0x5e0
get_page_from_freelist+0x4ce8/0x50d4
__alloc_pages_nodemask+0x738/0x38b8
cache_grow_begin+0xd8/0xa24
____cache_alloc_node+0x14c/0x268
__kmalloc+0x1c8/0x3fc
ftrace_free_mem+0x408/0x1284
ftrace_free_init_mem+0x20/0x28
kernel_init+0x24/0x548
ret_from_fork+0x10/0x18
Memory state around the buggy address: ffff809681b89c00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe ffff809681b89d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
>ffff809681b89e00: 99 99 99 99 99 99 99 99 fe fe fe fe fe fe fe fe
^ ffff809681b89f00: 43 43 43 43 43 fe fe fe fe fe fe fe fe fe fe fe ffff809681b8a000: 6d fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
Link: http://lkml.kernel.org/r/20190403022858.97584-1-cai@lca.pw Fixes: c30cac8dff00 ("kasan, slab: make freelist stored without tags") Signed-off-by: Qian Cai <cai@lca.pw> Reviewed-by: Andrey Konovalov <andreyknvl@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fix from Catalin Marinas:
"Avoid compiler uninitialised warning introduced by recent arm64 futex
fix"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: futex: Restore oldval initialization to work around buggy compilers
arm64: futex: Restore oldval initialization to work around buggy compilers
Commit 2c6482567488 ("arm64: futex: Fix FUTEX_WAKE_OP atomic ops with
non-zero result value") removed oldval's zero initialization in
arch_futex_atomic_op_inuser because it is not necessary. Unfortunately,
Android's arm64 GCC 4.9.4 [1] does not agree:
../kernel/futex.c: In function 'do_futex':
../kernel/futex.c:1658:17: warning: 'oldval' may be used uninitialized
in this function [-Wmaybe-uninitialized]
return oldval == cmparg;
^
In file included from ../kernel/futex.c:73:0:
../arch/arm64/include/asm/futex.h:53:6: note: 'oldval' was declared here
int oldval, ret, tmp;
^
GCC fails to follow that when ret is non-zero, futex_atomic_op_inuser
returns right away, avoiding the uninitialized use that it claims.
Restoring the zero initialization works around this issue.
As stated in the original commit for pidfd_send_signal() we don't allow
to signal processes through O_PATH file descriptors since it is
semantically equivalent to a write on the pidfd.
We already correctly error out right now and return EBADF if an O_PATH
fd is passed. This is because we use file->f_op to detect whether a
pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
in do_dentry_open() and thus fail the test.
Thus, there is no regression. It's just semantically correct to use
fdget() and return an error right from there instead of taking a
reference and returning an error later.
Signed-off-by: Christian Brauner <christian@brauner.io> Acked-by: Oleg Nesterov <oleg@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Kees Cook <keescook@chromium.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Jann Horn <jann@thejh.net> Cc: David Howells <dhowells@redhat.com> Cc: "Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com> Cc: Andy Lutomirsky <luto@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 's390-5.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 bug fixes from Martin Schwidefsky:
- Fix overwrite of the initial ramdisk due to misuse of IS_ENABLED
- Fix integer overflow in the dasd driver resulting in incorrect number
of blocks for large devices
- Fix a lockdep false positive in the 3270 driver
- Fix a deadlock in the zcrypt driver
- Fix incorrect debug feature entries in the pkey api
- Fix inline assembly constraints fallout with CONFIG_KASAN=y
* tag 's390-5.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390: correct some inline assembly constraints
s390/pkey: add one more argument space for debug feature entry
s390/zcrypt: fix possible deadlock situation on ap queue remove
s390/3270: fix lockdep false positive on view->lock
s390/dasd: Fix capacity calculation for large volumes
s390/mem_detect: Use IS_ENABLED(CONFIG_BLK_DEV_INITRD)
Merge tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull AFS fixes from David Howells:
- Stop using the deprecated get_seconds().
- Don't make tracepoint strings const as the section they go in isn't
read-only.
- Differentiate failure due to unmarshalling from other failure cases.
We shouldn't abort with RXGEN_CC/SS_UNMARSHAL if it's not due to
unmarshalling.
- Add a missing unlock_page().
- Fix the interaction between receiving a notification from a server
that it has invalidated all outstanding callback promises and a
client call that we're in the middle of making that will get a new
promise.
* tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
afs: Fix in-progess ops to ignore server-level callback invalidation
afs: Unlock pages for __pagevec_release()
afs: Differentiate abort due to unmarshalling from other errors
afs: Avoid section confusion in CM_NAME
afs: avoid deprecated get_seconds()
Merge tag 'drm-fixes-2019-04-18' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Since Easter is looming for me, I'm just pushing whatever is in my
tree, I'll see what else turns up and maybe I'll send another pull
early next week if there is anything.
tegra:
- stream id programming fix
- avoid divide by 0 for bad hdmi audio setup code
amdgpu:
- GPU VM fixes for Vega/RV
- DC AUX fix for active DP-DVI dongles
- DC fix for multihead regression"
* tag 'drm-fixes-2019-04-18' of git://anongit.freedesktop.org/drm/drm:
drm/tegra: hdmi: Setup audio only if configured
drm/amd/display: If one stream full updates, full update all planes
drm/amdgpu/gmc9: fix VM_L2_CNTL3 programming
drm/amdgpu: shadow in shadow_list without tbo.mem.start cause page fault in sriov TDR
gpu: host1x: Program stream ID to bypass without SMMU
drm/amd/display: extending AUX SW Timeout
drm/ttm: fix dma_fence refcount imbalance on error path
drm/ttm: fix incrementing the page pointer for huge pages
drm/ttm: fix start page for huge page check in ttm_put_pages()
drm/ttm: fix out-of-bounds read in ttm_put_pages() v2
Dave Airlie [Wed, 17 Apr 2019 20:56:26 +0000 (06:56 +1000)]
Merge branch 'drm-fixes-5.1' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
- GPUVM fixes for vega/RV and shadow buffers
- TTM fixes for hugepages
- TTM fix for refcount imbalance in error path
- DC AUX fix for some active DP-DVI dongles
- DC fix for multihead VT switch regression
Dave Airlie [Wed, 17 Apr 2019 20:55:34 +0000 (06:55 +1000)]
Merge tag 'drm/tegra/for-5.1-rc6' of git://anongit.freedesktop.org/tegra/linux into drm-fixes
drm/tegra: Fixes for v5.1-rc6
This contains a follow-up fix for the stream ID programming and a fix
for a regression on older Tegra devices (Tegra20 and Tegra30) that are
running into a division by zero trying to enable audio over HDMI.
Merge tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6
Pull smb3 fixes from Steve French:
"Five small SMB3 fixes, all also for stable - an important fix for an
oplock (lease) bug, a handle leak, and three bugs spotted by KASAN"
* tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
CIFS: keep FileInfo handle live during oplock break
cifs: fix handle leak in smb2_query_symlink()
cifs: Fix lease buffer length error
cifs: Fix use-after-free in SMB2_read
cifs: Fix use-after-free in SMB2_write
Merge tag 'for-linus-5.1-2' of git://github.com/cminyard/linux-ipmi
Pull IPMI fixes from Corey Minyard:
"Fixes for some bugs cause by recent changes. One crash if you feed bad
data to the module parameters, one BUG that sometimes occurs when a
user closes the connection, and one bug that cause the driver to not
work if the configuration information only comes in from SMBIOS"
* tag 'for-linus-5.1-2' of git://github.com/cminyard/linux-ipmi:
ipmi: fix sleep-in-atomic in free_user at cleanup SRCU user->release_barrier
ipmi: ipmi_si_hardcode.c: init si_type array to fix a crash
ipmi: Fix failure on SMBIOS specified devices
Inline assembly code changed in this patch should really use "Q"
constraint "Memory reference without index register and with short
displacement". The kernel build with kasan instrumentation enabled
might occasionally break otherwise (due to stack instrumentation).
Signed-off-by: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
The audio configuration is only valid if the HDMI codec has been
properly set up. Do not attempt to set up audio before that happens
because it causes a division by zero.
Note that this is only problematic on Tegra20 and Tegra30. Later chips
implement the division instructions which return zero when dividing by
zero and don't throw an exception.
Fixes: 12037a0f6ac0 ("drm/tegra: hdmi: Fix audio to work with any pixel clock rate") Reported-by: Marcel Ziswiler <marcel.ziswiler@toradex.com> Tested-by: Dmitry Osipenko <digetx@gmail.com> Signed-off-by: Thierry Reding <treding@nvidia.com>
It looks like the new socket options only work correctly
for native execution, but in case of compat mode fall back
to the old behavior as we ignore the 'old_timeval' flag.
Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the
same way in compat and native 32-bit mode.
Cc: Deepa Dinamani <deepa.kernel@gmail.com> Fixes: ca8198f4f80e ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Tue, 16 Apr 2019 17:55:20 +0000 (10:55 -0700)]
tcp: tcp_grow_window() needs to respect tcp_space()
For some reason, tcp_grow_window() correctly tests if enough room
is present before attempting to increase tp->rcv_ssthresh,
but does not prevent it to grow past tcp_space()
This is causing hard to debug issues, like failing
the (__tcp_select_window(sk) >= tp->rcv_wnd) test
in __tcp_ack_snd_check(), causing ACK delays and possibly
slow flows.
Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio,
we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000"
after about 60 round trips, when the active side no longer sends
immediate acks.
This bug predates git history.
Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
This is preventive cleanup that may save troubles later.
No need to cancel repeateadly queued work if code is properly
refactored.
Don't let the ethtool -s process interfere with the stat workqueue
scheduling.
Signed-off-by: Claudiu Manoil <claudiu.manoil@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
net: bridge: fix netlink export of vlan_stats_per_port option
Since the introduction of the vlan_stats_per_port option the netlink
export of it has been broken since I made a typo and used the ifla
attribute instead of the bridge option to retrieve its state.
Sysfs export is fine, only netlink export has been affected.
Fixes: a0f3c17c04dd3 ("net: bridge: add support for per-port vlan stats") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Colin Ian King [Tue, 16 Apr 2019 11:43:17 +0000 (12:43 +0100)]
qed: fix spelling mistake "faspath" -> "fastpath"
There is a spelling mistake in a DP_INFO message, fix it.
Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
Jie Liu [Tue, 16 Apr 2019 05:10:09 +0000 (13:10 +0800)]
tipc: set sysctl_tipc_rmem and named_timeout right range
We find that sysctl_tipc_rmem and named_timeout do not have the right minimum
setting. sysctl_tipc_rmem should be larger than zero, like sysctl_tcp_rmem.
And named_timeout as a timeout setting should be not less than zero.
Fixes: 4f71b04283e1a ("tipc: change socket buffer overflow control to respect sk_rcvbuf") Fixes: d607469a7f8bc ("tipc: add name distributor resiliency queue") Signed-off-by: Jie Liu <liujie165@huawei.com> Reported-by: Qiang Ning <ningqiang1@huawei.com> Reviewed-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
According to the link FSM, when a link endpoint got RESET_MSG (- a
traditional one without the stopping bit) from its peer, it moves to
PEER_RESET state and raises a LINK_DOWN event which then resets the
link itself. Its state will become ESTABLISHING after the reset event
and the link will be re-established soon after this endpoint starts to
send ACTIVATE_MSG to the peer.
There is no problem with this mechanism, however the link resetting has
cleared the link 'in_session' flag (along with the other important link
data such as: the link 'mtu') that was correctly set up at the 1st step
(i.e. when this endpoint received the peer RESET_MSG). As a result, the
link will become ESTABLISHED, but the 'in_session' flag is not set, and
all STATE_MSG from its peer will be dropped at the link_validate_msg().
It means the link not synced and will sooner or later face a failure.
Since the link reset action is obviously needed for a new link session
(this is also true in the other situations), the problem here is that
the link is re-established a bit too early when the link endpoints are
not really in-sync yet. The commit forces a resync as already done in
the previous commit 06836b36c226 ("tipc: fix link session and
re-establish issues") by simply varying the link 'peer_session' value
at the link_reset().
Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
net: Fix missing meta data in skb with vlan packet
skb_reorder_vlan_header() should move XDP meta data with ethernet header
if XDP meta data exists.
Fixes: 344eaa193a76 ("bpf: add meta pointer for direct access") Signed-off-by: Yuya Kusakabe <yuya.kusakabe@gmail.com> Signed-off-by: Takeru Hayasaka <taketarou2@gmail.com> Co-developed-by: Takeru Hayasaka <taketarou2@gmail.com> Reviewed-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
Fix this by sanitizing arg before using it to index dev_lec.
Notice that given that speculation windows are large, the policy is
to kill the speculation on the first load and not worry if it can be
completed with a dependent load/store [1].
net/core: work around section mismatch warning for ptp_classifier
The routine ptp_classifier_init() uses an initializer for an
automatic struct type variable which refers to an __initdata
symbol. This is perfectly legal, but may trigger a section
mismatch warning when running the compiler in -fpic mode, due
to the fact that the initializer may be emitted into an anonymous
.data section thats lack the __init annotation. So work around it
by using assignments instead.
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When the commit below was introduced it changed two visible things:
- the skb was no longer passed through the protocol handlers with the
original device
- the skb was passed up the stack with skb->dev = bridge
The first change broke af_packet sockets on bridge ports. For example we
use them for hostapd which listens for ETH_P_PAE packets on the ports.
We discussed two possible fixes:
- create a clone and pass it through NF_HOOK(), act on the original skb
based on the result
- somehow signal to the caller from the okfn() that it was called,
meaning the skb is ok to be passed, which this patch is trying to
implement via returning 1 from the bridge link-local okfn()
Note that we rely on the fact that NF_QUEUE/STOLEN would return 0 and
drop/error would return < 0 thus the okfn() is called only when the
return was 1, so we signal to the caller that it was called by preserving
the return value from nf_hook().
Fixes: 5f7bf977db3c ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Tony Camuso [Tue, 9 Apr 2019 19:20:03 +0000 (15:20 -0400)]
ipmi: ipmi_si_hardcode.c: init si_type array to fix a crash
The intended behavior of function ipmi_hardcode_init_one() is to default
to kcs interface when no type argument is presented when initializing
ipmi with hard coded addresses.
However, the array of char pointers allocated on the stack by function
ipmi_hardcode_init() was not inited to zeroes, so it contained stack
debris.
Consequently, passing the cruft stored in this array to function
ipmi_hardcode_init_one() caused a crash when it was unable to detect
that the char * being passed was nonsense and tried to access the
address specified by the bogus pointer.
The fix is simply to initialize the si_type array to zeroes, so if
there were no type argument given to at the command line, function
ipmi_hardcode_init_one() could properly default to the kcs interface.
Signed-off-by: Tony Camuso <tcamuso@redhat.com>
Message-Id: <1554837603-40299-1-git-send-email-tcamuso@redhat.com> Signed-off-by: Corey Minyard <cminyard@mvista.com>
Merge tag 'riscv-for-linus-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
Pull RISC-V fixes from Palmer Dabbelt:
"This contains an assortment of RISC-V-related fixups that we found
after rc4. They're all really unrelated:
- The addition of a 32-bit defconfig, to emphasize testing the 32-bit
port.
- A device tree bindings patch, which is pre-work for some patches
that target 5.2.
- A fix to support booting on systems with more physical memory than
the maximum supported by the kernel"
* tag 'riscv-for-linus-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux:
RISC-V: Fix Maximum Physical Memory 2GiB option for 64bit systems
dt-bindings: clock: sifive: add FU540-C000 PRCI clock constants
RISC-V: Add separate defconfig for 32bit systems
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM fixes from Paolo Bonzini:
"5.1 keeps its reputation as a big bugfix release for KVM x86.
- Fix for a memory leak introduced during the merge window
- Fixes for nested VMX with ept=0
- Fixes for AMD (APIC virtualization, NMI injection)
- Fixes for Hyper-V under KVM and KVM under Hyper-V
- Fixes for 32-bit SMM and tests for SMM virtualization
- More array_index_nospec peppering"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (21 commits)
KVM: x86: avoid misreporting level-triggered irqs as edge-triggered in tracing
KVM: fix spectrev1 gadgets
KVM: x86: fix warning Using plain integer as NULL pointer
selftests: kvm: add a selftest for SMM
selftests: kvm: fix for compilers that do not support -no-pie
selftests: kvm/evmcs_test: complete I/O before migrating guest state
KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernels
KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU
KVM: x86: clear SMM flags before loading state while leaving SMM
KVM: x86: Open code kvm_set_hflags
KVM: x86: Load SMRAM in a single shot when leaving SMM
KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU
KVM: x86: Raise #GP when guest vCPU do not support PMU
x86/kvm: move kvm_load/put_guest_xcr0 into atomic context
KVM: x86: svm: make sure NMI is injected after nmi_singlestep
svm/avic: Fix invalidate logical APIC id entry
Revert "svm: Fix AVIC incomplete IPI emulation"
kvm: mmu: Fix overflow on kvm mmu page limit calculation
KVM: nVMX: always use early vmcs check when EPT is disabled
KVM: nVMX: allow tests to use bad virtual-APIC page address
...
Aurelien Aptel [Fri, 29 Mar 2019 09:49:12 +0000 (10:49 +0100)]
CIFS: keep FileInfo handle live during oplock break
In the oplock break handler, writing pending changes from pages puts
the FileInfo handle. If the refcount reaches zero it closes the handle
and waits for any oplock break handler to return, thus causing a deadlock.
To prevent this situation:
* We add a wait flag to cifsFileInfo_put() to decide whether we should
wait for running/pending oplock break handlers
* We keep an additionnal reference of the SMB FileInfo handle so that
for the rest of the handler putting the handle won't close it.
- The ref is bumped everytime we queue the handler via the
cifs_queue_oplock_break() helper.
- The ref is decremented at the end of the handler
This bug was triggered by xfstest 464.
Also important fix to address the various reports of
oops in smb2_push_mandatory_locks
Signed-off-by: Aurelien Aptel <aaptel@suse.com> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> CC: Stable <stable@vger.kernel.org>
Ronnie Sahlberg [Tue, 9 Apr 2019 21:47:22 +0000 (07:47 +1000)]
cifs: fix handle leak in smb2_query_symlink()
If we enter smb2_query_symlink() for something that is not a symlink
and where the SMB2_open() would succeed we would never end up
closing this handle and would thus leak a handle on the server.
Fix this by immediately calling SMB2_close() on successfull open.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> CC: Stable <stable@vger.kernel.org> Signed-off-by: Steve French <stfrench@microsoft.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
There is a KASAN slab-out-of-bounds:
BUG: KASAN: slab-out-of-bounds in _copy_from_iter_full+0x783/0xaa0
Read of size 80 at addr ffff88810c35e180 by task mount.cifs/539
It can be reproduced by the following step:
1. samba configured with: server max protocol = SMB2_10
2. mount -o vers=default
When parse the mount version parameter, the 'ops' and 'vals'
was setted to smb30, if negotiate result is smb21, just
update the 'ops' to smb21, but the 'vals' is still smb30.
When add lease context, the iov_base is allocated with smb21
ops, but the iov_len is initiallited with the smb30. Because
the iov_len is longer than iov_base, when send the message,
copy array out of bounds.
we need to keep the 'ops' and 'vals' consistent.
Fixes: 26b02edb468e ("SMB3: Add support for multidialect negotiate (SMB2.1 and later)") Fixes: ddde1a77a34e ("smb3: add smb3.1.1 to default dialect list") Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: Steve French <stfrench@microsoft.com> CC: Stable <stable@vger.kernel.org> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Vitaly Kuznetsov [Wed, 27 Mar 2019 14:12:20 +0000 (15:12 +0100)]
KVM: x86: avoid misreporting level-triggered irqs as edge-triggered in tracing
In __apic_accept_irq() interface trig_mode is int and actually on some code
paths it is set above u8:
kvm_apic_set_irq() extracts it from 'struct kvm_lapic_irq' where trig_mode
is u16. This is done on purpose as e.g. kvm_set_msi_irq() sets it to
(1 << 15) & e->msi.data
kvm_apic_local_deliver sets it to reg & (1 << 15).
Fix the immediate issue by making 'tm' into u16. We may also want to adjust
__apic_accept_irq() interface and use proper sizes for vector, level,
trig_mode but this is not urgent.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Add a simple test for SMM, based on VMX. The test implements its own
sync between the guest and the host as using our ucall library seems to
be too cumbersome: SMI handler is happening in real-address mode.
This patch also fixes KVM_SET_NESTED_STATE to happen after
KVM_SET_VCPU_EVENTS, in fact it places it last. This is because
KVM needs to know whether the processor is in SMM or not.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 11 Apr 2019 13:51:19 +0000 (15:51 +0200)]
selftests: kvm: fix for compilers that do not support -no-pie
-no-pie was added to GCC at the same time as their configuration option
--enable-default-pie. Compilers that were built before do not have
-no-pie, but they also do not need it. Detect the option at build
time.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Thu, 11 Apr 2019 13:57:14 +0000 (15:57 +0200)]
selftests: kvm/evmcs_test: complete I/O before migrating guest state
Starting state migration after an IO exit without first completing IO
may result in test failures. We already have two tests that need this
(this patch in fact fixes evmcs_test, similar to what was fixed for
state_test in commit 7b8c2f1bac8d, "KVM: selftests: complete IO before
migrating guest state", 2019-03-13) and a third is coming. So, move the
code to vcpu_save_state, and while at it do not access register state
until after I/O is complete.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Always use 32-bit SMRAM save state for 32-bit kernels
Invoking the 64-bit variation on a 32-bit kenrel will crash the guest,
trigger a WARN, and/or lead to a buffer overrun in the host, e.g.
rsm_load_state_64() writes r8-r15 unconditionally, but enum kvm_reg and
thus x86_emulate_ctxt._regs only define r8-r15 for CONFIG_X86_64.
KVM allows userspace to report long mode support via CPUID, even though
the guest is all but guaranteed to crash if it actually tries to enable
long mode. But, a pure 32-bit guest that is ignorant of long mode will
happily plod along.
SMM complicates things as 64-bit CPUs use a different SMRAM save state
area. KVM handles this correctly for 64-bit kernels, e.g. uses the
legacy save state map if userspace has hid long mode from the guest,
but doesn't fare well when userspace reports long mode support on a
32-bit host kernel (32-bit KVM doesn't support 64-bit guests).
Since the alternative is to crash the guest, e.g. by not loading state
or explicitly requesting shutdown, unconditionally use the legacy SMRAM
save state map for 32-bit KVM. If a guest has managed to get far enough
to handle SMIs when running under a weird/buggy userspace hypervisor,
then don't deliberately crash the guest since there are no downsides
(from KVM's perspective) to allow it to continue running.
Fixes: ff72cf22cee1f ("KVM: x86: save/load state on SMM switch") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Don't clear EFER during SMM transitions for 32-bit vCPU
Neither AMD nor Intel CPUs have an EFER field in the legacy SMRAM save
state area, i.e. don't save/restore EFER across SMM transitions. KVM
somewhat models this, e.g. doesn't clear EFER on entry to SMM if the
guest doesn't support long mode. But during RSM, KVM unconditionally
clears EFER so that it can get back to pure 32-bit mode in order to
start loading CRs with their actual non-SMM values.
Clear EFER only when it will be written when loading the non-SMM state
so as to preserve bits that can theoretically be set on 32-bit vCPUs,
e.g. KVM always emulates EFER_SCE.
And because CR4.PAE is cleared only to play nice with EFER, wrap that
code in the long mode check as well. Note, this may result in a
compiler warning about cr4 being consumed uninitialized. Re-read CR4
even though it's technically unnecessary, as doing so allows for more
readable code and RSM emulation is not a performance critical path.
Fixes: ff72cf22cee1f ("KVM: x86: save/load state on SMM switch") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: clear SMM flags before loading state while leaving SMM
RSM emulation is currently broken on VMX when the interrupted guest has
CR4.VMXE=1. Stop dancing around the issue of HF_SMM_MASK being set when
loading SMSTATE into architectural state, e.g. by toggling it for
problematic flows, and simply clear HF_SMM_MASK prior to loading
architectural state (from SMRAM save state area).
Reported-by: Jon Doron <arilou@gmail.com> Cc: Jim Mattson <jmattson@google.com> Cc: Liran Alon <liran.alon@oracle.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Fixes: b274b14a32a0 ("KVM: VMX: check nested state and CR4.VMXE against SMM") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Prepare for clearing HF_SMM_MASK prior to loading state from the SMRAM
save state map, i.e. kvm_smm_changed() needs to be called after state
has been loaded and so cannot be done automatically when setting
hflags from RSM.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: Load SMRAM in a single shot when leaving SMM
RSM emulation is currently broken on VMX when the interrupted guest has
CR4.VMXE=1. Rather than dance around the issue of HF_SMM_MASK being set
when loading SMSTATE into architectural state, ideally RSM emulation
itself would be reworked to clear HF_SMM_MASK prior to loading non-SMM
architectural state.
Ostensibly, the only motivation for having HF_SMM_MASK set throughout
the loading of state from the SMRAM save state area is so that the
memory accesses from GET_SMSTATE() are tagged with role.smm. Load
all of the SMRAM save state area from guest memory at the beginning of
RSM emulation, and load state from the buffer instead of reading guest
memory one-by-one.
This paves the way for clearing HF_SMM_MASK prior to loading state,
and also aligns RSM with the enter_smm() behavior, which fills a
buffer and writes SMRAM save state in a single go.
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liran Alon [Mon, 25 Mar 2019 19:09:17 +0000 (21:09 +0200)]
KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU
Issue was discovered when running kvm-unit-tests on KVM running as L1 on
top of Hyper-V.
When vmx_instruction_intercept unit-test attempts to run RDPMC to test
RDPMC-exiting, it is intercepted by L1 KVM which it's EXIT_REASON_RDPMC
handler raise #GP because vCPU exposed by Hyper-V doesn't support PMU.
Instead of unit-test expectation to be reflected with EXIT_REASON_RDPMC.
The reason vmx_instruction_intercept unit-test attempts to run RDPMC
even though Hyper-V doesn't support PMU is because L1 expose to L2
support for RDPMC-exiting. Which is reasonable to assume that is
supported only in case CPU supports PMU to being with.
Above issue can easily be simulated by modifying
vmx_instruction_intercept config in x86/unittests.cfg to run QEMU with
"-cpu host,+vmx,-pmu" and run unit-test.
To handle issue, change KVM to expose RDPMC-exiting only when guest
supports PMU.
Reported-by: Saar Amar <saaramar@microsoft.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Liran Alon [Mon, 25 Mar 2019 19:10:17 +0000 (21:10 +0200)]
KVM: x86: Raise #GP when guest vCPU do not support PMU
Before this change, reading a VMware pseduo PMC will succeed even when
PMU is not supported by guest. This can easily be seen by running
kvm-unit-test vmware_backdoors with "-cpu host,-pmu" option.
Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Signed-off-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
out, host cpu has guest xcr0 loaded (0xff).
In __switch_to {
switch_fpu_finish
copy_kernel_to_fpregs
XRSTORS
If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
and tries to reinitialize fpu by restoring init fpu state. Same story as
last #GP, except we get DOUBLE FAULT this time.
Cc: stable@vger.kernel.org Signed-off-by: WANG Chao <chao.wang@ucloud.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: x86: svm: make sure NMI is injected after nmi_singlestep
I noticed that apic test from kvm-unit-tests always hangs on my EPYC 7401P,
the hanging test nmi-after-sti is trying to deliver 30000 NMIs and tracing
shows that we're sometimes able to deliver a few but never all.
When we're trying to inject an NMI we may fail to do so immediately for
various reasons, however, we still need to inject it so enable_nmi_window()
arms nmi_singlestep mode. #DB occurs as expected, but we're not checking
for pending NMIs before entering the guest and unless there's a different
event to process, the NMI will never get delivered.
Make KVM_REQ_EVENT request on the vCPU from db_interception() to make sure
pending NMIs are checked and possibly injected.
Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Only clear the valid bit when invalidate logical APIC id entry.
The current logic clear the valid bit, but also set the rest of
the bits (including reserved bits) to 1.
Fixes: 5d466ba72d15 ('svm: Fix AVIC DFR and LDR handling') Signed-off-by: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
that the change coud potentially cause an extra IPI to be sent to
the destination vcpu because the AVIC hardware already set the IRR bit
before the incomplete IPI #VMEXIT with id=1 (target vcpu is not running).
Since writting to ICR and ICR2 will also set the IRR. If something triggers
the destination vcpu to get scheduled before the emulation finishes, then
this could result in an additional IPI.
Also, the issue mentioned in the commit c6f94dfcd991 was misdiagnosed.
Ben Gardon [Mon, 8 Apr 2019 18:07:30 +0000 (11:07 -0700)]
kvm: mmu: Fix overflow on kvm mmu page limit calculation
KVM bases its memory usage limits on the total number of guest pages
across all memslots. However, those limits, and the calculations to
produce them, use 32 bit unsigned integers. This can result in overflow
if a VM has more guest pages that can be represented by a u32. As a
result of this overflow, KVM can use a low limit on the number of MMU
pages it will allocate. This makes KVM unable to map all of guest memory
at once, prompting spurious faults.
Tested: Ran all kvm-unit-tests on an Intel Haswell machine. This patch
introduced no new failures.
Signed-off-by: Ben Gardon <bgardon@google.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 15 Apr 2019 13:57:19 +0000 (15:57 +0200)]
KVM: nVMX: always use early vmcs check when EPT is disabled
The remaining failures of vmx.flat when EPT is disabled are caused by
incorrectly reflecting VMfails to the L1 hypervisor. What happens is
that nested_vmx_restore_host_state corrupts the guest CR3, reloading it
with the host's shadow CR3 instead, because it blindly loads GUEST_CR3
from the vmcs01.
For simplicity let's just always use hardware VMCS checks when EPT is
disabled. This way, nested_vmx_restore_host_state is not reached at
all (or at least shouldn't be reached).
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Paolo Bonzini [Mon, 15 Apr 2019 13:16:17 +0000 (15:16 +0200)]
KVM: nVMX: allow tests to use bad virtual-APIC page address
As mentioned in the comment, there are some special cases where we can simply
clear the TPR shadow bit from the CPU-based execution controls in the vmcs02.
Handle them so that we can remove some XFAILs from vmx.flat.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Merge tag 'libnvdimm-fixes-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull libnvdimm fixes from Dan Williams:
"I debated holding this back for the v5.2 merge window due to the size
of the "zero-key" changes, but affected users would benefit from
having the fixes sooner. It did not make sense to change the zero-key
semantic in isolation for the "secure-erase" command, but instead
include it for all security commands.
The short background on the need for these changes is that some NVDIMM
platforms enable security with a default zero-key rather than let the
OS specify the initial key. This makes the security enabling that
landed in v5.0 unusable for some users.
Summary:
- Compatibility fix for nvdimm-security implementations with a
default zero-key.
- Miscellaneous small fixes for out-of-bound accesses, cleanup after
initialization failures, and missing debug messages"
* tag 'libnvdimm-fixes-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
tools/testing/nvdimm: Retain security state after overwrite
libnvdimm/pmem: fix a possible OOB access when read and write pmem
libnvdimm/security, acpi/nfit: unify zero-key for all security commands
libnvdimm/security: provide fix for secure-erase to use zero-key
libnvdimm/btt: Fix a kmemdup failure check
libnvdimm/namespace: Fix a potential NULL pointer dereference
acpi/nfit: Always dump _DSM output payload
Merge tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull fsdax fix from Dan Williams:
"A single filesystem-dax fix. It has been lingering in -next for a long
while and there are no other fsdax fixes on the horizon:
- Avoid a crash scenario with architectures like powerpc that require
'pgtable_deposit' for the zero page"
* tag 'fsdax-fix-5.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
fs/dax: Deposit pagetable even when installing zero page
Jonathan Lemon [Sun, 14 Apr 2019 21:21:29 +0000 (14:21 -0700)]
route: Avoid crash from dereferencing NULL rt->from
When __ip6_rt_update_pmtu() is called, rt->from is RCU dereferenced, but is
never checked for null - rt6_flush_exceptions() may have removed the entry.
Fixes: 17ca15645c70 ("net/ipv6: Make from in rt6_info rcu protected") Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@gmail.com> Reviewed-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
MAINTAINERS contains a lower-case and upper-case variant of
Woojung Huh' s email address.
Only keep the lower-case variant in MAINTAINERS.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Acked-by: Woojung Huh <woojung.huh@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When a bond is enslaved to another bond, bond_netdev_event() only
handles the event as if the bond is a master, and skips treating the
bond as a slave.
This leads to a refcount leak on the slave, since we don't remove the
adjacency to its master and the master holds a reference on the slave.
Reproducer:
ip link add bondL type bond
ip link add bondU type bond
ip link set bondL master bondU
ip link del bondL
No "Fixes:" tag, this code is older than git history.
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Signed-off-by: David S. Miller <davem@davemloft.net>
The reverted bugfix will cause another issue.
Reported by syzbot+6024817a931b2830bc93@syzkaller.appspotmail.com.
See https://syzkaller.appspot.com/x/log.txt?x=1737671b200000 for
details.
Signed-off-by: Wang Hai <wanghai26@huawei.com> Acked-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
KVM: x86/mmu: Fix an inverted list_empty() check when zapping sptes
A recently introduced helper for handling zap vs. remote flush
incorrectly bails early, effectively leaking defunct shadow pages.
Manifests as a slab BUG when exiting KVM due to the shadow pages
being alive when their associated cache is destroyed.
==========================================================================
BUG kvm_mmu_page_header: Objects remaining in kvm_mmu_page_header on ...
--------------------------------------------------------------------------
Disabling lock debugging due to kernel taint
INFO: Slab 0x00000000fc436387 objects=26 used=23 fp=0x00000000d023caee ...
CPU: 6 PID: 4315 Comm: rmmod Tainted: G B 5.1.0-rc2+ #19
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
Call Trace:
dump_stack+0x46/0x5b
slab_err+0xad/0xd0
? on_each_cpu_mask+0x3c/0x50
? ksm_migrate_page+0x60/0x60
? on_each_cpu_cond_mask+0x7c/0xa0
? __kmalloc+0x1ca/0x1e0
__kmem_cache_shutdown+0x13a/0x310
shutdown_cache+0xf/0x130
kmem_cache_destroy+0x1d5/0x200
kvm_mmu_module_exit+0xa/0x30 [kvm]
kvm_arch_exit+0x45/0x60 [kvm]
kvm_exit+0x6f/0x80 [kvm]
vmx_exit+0x1a/0x50 [kvm_intel]
__x64_sys_delete_module+0x153/0x1f0
? exit_to_usermode_loop+0x88/0xc0
do_syscall_64+0x4f/0x100
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Fixes: 3ea9d688660b0 ("KVM: x86/mmu: Split remote_flush+zap case out of kvm_mmu_flush_or_zap()") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
s390/pkey: add one more argument space for debug feature entry
The debug feature entries have been used with up to 5 arguents
(including the pointer to the format string) but there was only
space reserved for 4 arguemnts. So now the registration does
reserve space for 5 times a long value.
This fixes a sometime appearing weired value as the last
value of an debug feature entry like this:
... pkey_sec2protkey zcrypt_send_cprb (cardnr=10 domain=12)
failed with errno -2143346254
Signed-off-by: Harald Freudenberger <freude@linux.ibm.com> Reported-by: Christian Rund <Christian.Rund@de.ibm.com> Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
David Francis [Fri, 29 Mar 2019 17:23:15 +0000 (13:23 -0400)]
drm/amd/display: If one stream full updates, full update all planes
[Why]
On some compositors, with two monitors attached, VT terminal
switch can cause a graphical issue by the following means:
There are two streams, one for each monitor. Each stream has one
plane
current state:
M1:S1->P1
M2:S2->P2
The user calls for a terminal switch and a commit is made to
change both planes to linear swizzle mode. In atomic check,
a new dc_state is constructed with new planes on each stream
new state:
M1:S1->P3
M2:S2->P4
In commit tail, each stream is committed, one at a time. The first
stream (S1) updates properly, triggerring a full update and replacing
the state
current state:
M1:S1->P3
M2:S2->P4
The update for S2 comes in, but dc detects that there is no difference
between the stream and plane in the new and current states, and so
triggers a fast update. The fast update does not program swizzle,
so the second monitor is corrupted
[How]
Add a flag to dc_plane_state that forces full updates
When a stream undergoes a full update, set this flag on all changed
planes, then clear it on the current stream
Subsequent streams will get full updates as a result
Signed-off-by: David Francis <David.Francis@amd.com> Signed-off-by: Nicholas Kazlauskas <nicholas.kazlauskas@amd.com> Reviewed-by: Roman Li <Roman.Li@amd.com> Acked-by: Bhawanpreet Lakha <Bhawanpreet Lakha@amd.com> Acked-by: Nicholas Kazlauskas <Nicholas.Kazlauskas@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Jann Horn reported that he can overflow the page ref count with
sufficient memory (and a filesystem that is intentionally extremely
slow).
Admittedly it's not exactly easy. To have more than four billion
references to a page requires a minimum of 32GB of kernel memory just
for the pointers to the pages, much less any metadata to keep track of
those pointers. Jann needed a total of 140GB of memory and a specially
crafted filesystem that leaves all reads pending (in order to not ever
free the page references and just keep adding more).
Still, we have a fairly straightforward way to limit the two obvious
user-controllable sources of page references: direct-IO like page
references gotten through get_user_pages(), and the splice pipe page
duplication. So let's just do that.
* branch page-refs:
fs: prevent page refcount overflow in pipe_buf_get
mm: prevent get_user_pages() from overflowing page refcount
mm: add 'try_get_page()' helper function
mm: make page ref count overflow check tighter and more explicit
I've cc'ed some of the checksum fixes to Eric Dumazet and i would like to get
his feedback before you pull.
For -stable v4.19
('net/mlx5: FPGA, tls, idr remove on flow delete')
('net/mlx5: FPGA, tls, hold rcu read lock a bit longer')
For -stable v4.20
('net/mlx5e: Rx, Check ip headers sanity')
('Revert "net/mlx5e: Enable reporting checksum unnecessary also for L3 packets"')
('net/mlx5e: Rx, Fixup skb checksum for packets with tail padding')
For -stable v5.0
('net/mlx5e: Switch to Toeplitz RSS hash by default')
('net/mlx5e: Protect against non-uplink representor for encap')
('net/mlx5e: XDP, Avoid checksum complete when XDP prog is loaded')
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: f2b95d9b2793 ("rtnetlink: stats: validate attributes in get as well as dumps") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Sun, 14 Apr 2019 20:59:49 +0000 (13:59 -0700)]
Merge branch 'qed-doorbell-overflow-recovery'
Denis Bolotin says:
====================
qed: Fix the Doorbell Overflow Recovery mechanism
This patch series fixes and improves the doorbell recovery mechanism.
The main goals of this series are to fix missing attentions from the
doorbells block (DORQ) or not handling them properly, and execute the
recovery from periodic handler instead of the attention handler.
Please consider applying the series to net.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Separate the overflow handling from the hardware interrupt status analysis.
The interrupt status is a single register and is common for all PFs. The
first PF reading the register is not necessarily the one who overflowed.
All PFs must check their overflow status on every attention.
In this change we clear the sticky indication in the attention handler to
allow doorbells to be processed again as soon as possible, but running
the doorbell recovery is scheduled for the periodic handler to reduce the
time spent in the attention handler.
Checking the need for DORQ flush was changed to "db_bar_no_edpm" because
qed_edpm_enabled()'s result could change dynamically and might have
prevented a needed flush.
Signed-off-by: Denis Bolotin <dbolotin@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
When the DORQ (doorbell block) is overflowed, all PFs get attentions at the
same time. If one PF finished handling the attention before another PF even
started, the second PF might miss the DORQ's attention bit and not handle
the attention at all.
If the DORQ attention is missed and the issue is not resolved, another
attention will not be sent, therefore each attention is treated as a
potential DORQ attention.
As a result, the attention callback is called more frequently so the debug
print was moved to reduce its quantity.
The number of periodic doorbell recovery handler schedules was reduced
because it was the previous way to mitigating the missed attention issue.
Signed-off-by: Denis Bolotin <dbolotin@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fix the condition which verifies that doorbell address is inside the
doorbell bar by checking that the end of the address is within range
as well.
Signed-off-by: Denis Bolotin <dbolotin@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
DB_REC_DRY_RUN (running doorbell recovery without sending doorbells) is
never used. DB_REC_ONCE (send a single doorbell from the doorbell recovery)
is not needed anymore because by running the periodic handler we make sure
we check the overflow status later instead.
This patch is needed because in the next patches, the only doorbell
recovery type being used is DB_REC_REAL_DEAL, and the fixes are much
cleaner without this enum.
Signed-off-by: Denis Bolotin <dbolotin@marvell.com> Signed-off-by: Michal Kalderon <mkalderon@marvell.com> Signed-off-by: Ariel Elior <aelior@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Fixes: 2c72e2490960 ("ipv4: recompile ip options in ipv4_link_failure") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Matthew Wilcox [Fri, 5 Apr 2019 21:02:10 +0000 (14:02 -0700)]
fs: prevent page refcount overflow in pipe_buf_get
Change pipe_buf_get() to return a bool indicating whether it succeeded
in raising the refcount of the page (if the thing in the pipe is a page).
This removes another mechanism for overflowing the page refcount. All
callers converted to handle a failure.
mm: prevent get_user_pages() from overflowing page refcount
If the page refcount wraps around past zero, it will be freed while
there are still four billion references to it. One of the possible
avenues for an attacker to try to make this happen is by doing direct IO
on a page multiple times. This patch makes get_user_pages() refuse to
take a new page reference if there are already more than two billion
references to the page.
This is the same as the traditional 'get_page()' function, but instead
of unconditionally incrementing the reference count of the page, it only
does so if the count was "safe". It returns whether the reference count
was incremented (and is marked __must_check, since the caller obviously
has to be aware of it).
Also like 'get_page()', you can't use this function unless you already
had a reference to the page. The intent is that you can use this
exactly like get_page(), but in situations where you want to limit the
maximum reference count.
The code currently does an unconditional WARN_ON_ONCE() if we ever hit
the reference count issues (either zero or negative), as a notification
that the conditional non-increment actually happened.
NOTE! The count access for the "safety" check is inherently racy, but
that doesn't matter since the buffer we use is basically half the range
of the reference count (ie we look at the sign of the count).
mm: make page ref count overflow check tighter and more explicit
We have a VM_BUG_ON() to check that the page reference count doesn't
underflow (or get close to overflow) by checking the sign of the count.
That's all fine, but we actually want to allow people to use a "get page
ref unless it's already very high" helper function, and we want that one
to use the sign of the page ref (without triggering this VM_BUG_ON).
Change the VM_BUG_ON to only check for small underflows (or _very_ close
to overflowing), and ignore overflows which have strayed into negative
territory.
Merge tag 'for-linus-20190412' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"Set of fixes that should go into this round. This pull is larger than
I'd like at this time, but there's really no specific reason for that.
Some are fixes for issues that went into this merge window, others are
not. Anyway, this contains:
- Hardware queue limiting for virtio-blk/scsi (Dongli)
- Multi-page bvec fixes for lightnvm pblk
- Multi-bio dio error fix (Jason)
- Remove the cache hint from the io_uring tool side, since we didn't
move forward with that (me)
- Make io_uring SETUP_SQPOLL root restricted (me)
- Fix leak of page in error handling for pc requests (Jérôme)
- Fix BFQ regression introduced in this merge window (Paolo)
- Fix break logic for bio segment iteration (Ming)
- Fix NVMe cancel request error handling (Ming)
- NVMe pull request with two fixes (Christoph):
- fix the initial CSN for nvme-fc (James)
- handle log page offsets properly in the target (Keith)"
* tag 'for-linus-20190412' of git://git.kernel.dk/linux-block:
block: fix the return errno for direct IO
nvmet: fix discover log page when offsets are used
nvme-fc: correct csn initialization and increments on error
block: do not leak memory in bio_copy_user_iov()
lightnvm: pblk: fix crash in pblk_end_partial_read due to multipage bvecs
nvme: cancel request synchronously
blk-mq: introduce blk_mq_complete_request_sync()
scsi: virtio_scsi: limit number of hw queues by nr_cpu_ids
virtio-blk: limit number of hw queues by nr_cpu_ids
block, bfq: fix use after free in bfq_bfqq_expire
io_uring: restrict IORING_SETUP_SQPOLL to root
tools/io_uring: remove IOCQE_FLAG_CACHEHIT
block: don't use for-inside-for in bio_for_each_segment_all
Merge tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
Pull NFS client bugfixes from Trond Myklebust:
"Highlights include:
Stable fix:
- Fix a deadlock in close() due to incorrect draining of RDMA queues
Bugfixes:
- Revert "SUNRPC: Micro-optimise when the task is known not to be
sleeping" as it is causing stack overflows
- Fix a regression where NFSv4 getacl and fs_locations stopped
working
- Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
- Fix xfstests failures due to incorrect copy_file_range() return
values"
* tag 'nfs-for-5.1-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
Revert "SUNRPC: Micro-optimise when the task is known not to be sleeping"
NFSv4.1 fix incorrect return value in copy_file_range
xprtrdma: Fix helper that drains the transport
NFS: Fix handling of reply page vector
NFS: Forbid setting AF_INET6 to "struct sockaddr_in"->sin_family.
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"Here's more than a handful of clk driver fixes for changes that came
in during the merge window:
- Fix the AT91 sama5d2 programmable clk prescaler formula
- A bunch of Amlogic meson clk driver fixes for the VPU clks
- A DMI quirk for Intel's Bay Trail SoC's driver to properly mark pmc
clks as critical only when really needed
- Stop overwriting CLK_SET_RATE_PARENT flag in mediatek's clk gate
implementation
- Use the right structure to test for a frequency table in i.MX's
PLL_1416x driver"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: imx: Fix PLL_1416X not rounding rates
clk: mediatek: fix clk-gate flag setting
platform/x86: pmc_atom: Drop __initconst on dmi table
clk: x86: Add system specific quirk to mark clocks as critical
clk: meson: vid-pll-div: remove warning and return 0 on invalid config
clk: meson: pll: fix rounding and setting a rate that matches precisely
clk: meson-g12a: fix VPU clock parents
clk: meson: g12a: fix VPU clock muxes mask
clk: meson-gxbb: round the vdec dividers to closest
clk: at91: fix programmable clock for sama5d2
Merge tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci
Pull PCI fixes from Bjorn Helgaas:
- Add a DMA alias quirk for another Marvell SATA device (Andre
Przywara)
- Fix a pciehp regression that broke safe removal of devices (Sergey
Miroshnichenko)
* tag 'pci-v5.1-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci:
PCI: pciehp: Ignore Link State Changes after powering off a slot
PCI: Add function 1 DMA alias quirk for Marvell 9170 SATA controller
Merge tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
"A minor build fix for 64-bit FLATMEM configs.
A fix for a boot failure on 32-bit powermacs.
My commit to fix CLOCK_MONOTONIC across Y2038 broke the 32-bit VDSO on
64-bit kernels, ie. compat mode, which is only used on big endian.
The rewrite of the SLB code we merged in 4.20 missed the fact that the
0x380 exception is also used with the Radix MMU to report out of range
accesses. This could lead to an oops if userspace tried to read from
addresses outside the user or kernel range.
Thanks to: Aneesh Kumar K.V, Christophe Leroy, Larry Finger, Nicholas
Piggin"
* tag 'powerpc-5.1-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Define MAX_PHYSMEM_BITS for all 64-bit configs
powerpc/64s/radix: Fix radix segment exception handling
powerpc/vdso32: fix CLOCK_MONOTONIC on PPC64
powerpc/32: Fix early boot failure with RTAS built-in
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Will Deacon:
"The main thing is a fix to our FUTEX_WAKE_OP implementation which was
unbelievably broken, but did actually work for the one scenario that
GLIBC used to use.
- Fix terminally broken implementation of FUTEX_WAKE_OP atomics"
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value
arm64: backtrace: Don't bother trying to unwind the userspace stack
arm64/ftrace: fix inadvertent BUG() in trampoline check
David Howells [Sat, 13 Apr 2019 07:37:37 +0000 (08:37 +0100)]
afs: Fix in-progess ops to ignore server-level callback invalidation
The in-kernel afs filesystem client counts the number of server-level
callback invalidation events (CB.InitCallBackState* RPC operations) that it
receives from the server. This is stored in cb_s_break in various
structures, including afs_server and afs_vnode.
If an inode is examined by afs_validate(), say, the afs_server copy is
compared, along with other break counters, to those in afs_vnode, and if
one or more of the counters do not match, it is considered that the
server's callback promise is broken. At points where this happens,
AFS_VNODE_CB_PROMISED is cleared to indicate that the status must be
refetched from the server.
afs_validate() issues an FS.FetchStatus operation to get updated metadata -
and based on the updated data_version may invalidate the pagecache too.
However, the break counters are also used to determine whether to note a
new callback in the vnode (which would set the AFS_VNODE_CB_PROMISED flag)
and whether to cache the permit data included in the YFSFetchStatus record
by the server.
The problem comes when the server sends us a CB.InitCallBackState op. The
first such instance doesn't cause cb_s_break to be incremented, but rather
causes AFS_SERVER_FL_NEW to be cleared - but thereafter, say some hours
after last use and all the volumes have been automatically unmounted and
the server has forgotten about the client[*], this *will* likely cause an
increment.
[*] There are other circumstances too, such as the server restarting or
needing to make space in its callback table.
Note that the server won't send us a CB.InitCallBackState op until we talk
to it again.
So what happens is:
(1) A mount for a new volume is attempted, a inode is created for the root
vnode and vnode->cb_s_break and AFS_VNODE_CB_PROMISED aren't set
immediately, as we don't have a nominated server to talk to yet - and
we may iterate through a few to find one.
(2) Before the operation happens, afs_fetch_status(), say, notes in the
cursor (fc.cb_break) the break counter sum from the vnode, volume and
server counters, but the server->cb_s_break is currently 0.
(3) We send FS.FetchStatus to the server. The server sends us back
CB.InitCallBackState. We increment server->cb_s_break.
(4) Our FS.FetchStatus completes. The reply includes a callback record.
(5) xdr_decode_AFSCallBack()/xdr_decode_YFSCallBack() check to see whether
the callback promise was broken by checking the break counter sum from
step (2) against the current sum.
This fails because of step (3), so we don't set the callback record
and, importantly, don't set AFS_VNODE_CB_PROMISED on the vnode.
This does not preclude the syscall from progressing, and we don't loop here
rechecking the status, but rather assume it's good enough for one round
only and will need to be rechecked next time.
(6) afs_validate() it triggered on the vnode, probably called from
d_revalidate() checking the parent directory.
(7) afs_validate() notes that AFS_VNODE_CB_PROMISED isn't set, so doesn't
update vnode->cb_s_break and assumes the vnode to be invalid.
(8) afs_validate() needs to calls afs_fetch_status(). Go back to step (2)
and repeat, every time the vnode is validated.
This primarily affects volume root dir vnodes. Everything subsequent to
those inherit an already incremented cb_s_break upon mounting.
The issue is that we assume that the callback record and the cached permit
information in a reply from the server can't be trusted after getting a
server break - but this is wrong since the server makes sure things are
done in the right order, holding up our ops if necessary[*].
[*] There is an extremely unlikely scenario where a reply from before the
CB.InitCallBackState could get its delivery deferred till after - at
which point we think we have a promise when we don't. This, however,
requires unlucky mass packet loss to one call.
AFS_SERVER_FL_NEW tries to paper over the cracks for the initial mount from
a server we've never contacted before, but this should be unnecessary.
It's also further insulated from the problem on an initial mount by
querying the server first with FS.GetCapabilities, which triggers the
CB.InitCallBackState.
Fix this by
(1) Remove AFS_SERVER_FL_NEW.
(2) In afs_calc_vnode_cb_break(), don't include cb_s_break in the
calculation.
(3) In afs_cb_is_broken(), don't include cb_s_break in the check.
Signed-off-by: David Howells <dhowells@redhat.com>
Marc Dionne [Sat, 13 Apr 2019 07:37:37 +0000 (08:37 +0100)]
afs: Unlock pages for __pagevec_release()
__pagevec_release() complains loudly if any page in the vector is still
locked. The pages need to be locked for generic_error_remove_page(), but
that function doesn't actually unlock them.
Unlock the pages afterwards.
Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Jonathan Billings <jsbillin@umich.edu>
David Howells [Sat, 13 Apr 2019 07:37:37 +0000 (08:37 +0100)]
afs: Differentiate abort due to unmarshalling from other errors
Differentiate an abort due to an unmarshalling error from an abort due to
other errors, such as ENETUNREACH. It doesn't make sense to set abort code
RXGEN_*_UNMARSHAL in such a case, so use RX_USER_ABORT instead.
Signed-off-by: David Howells <dhowells@redhat.com>