Yang Yang [Tue, 22 Mar 2022 21:46:33 +0000 (14:46 -0700)]
mm/vmstat: add event for ksm swapping in copy
When faults in from swap what used to be a KSM page and that page had been
swapped in before, system has to make a copy, and leaves remerging the
pages to a later pass of ksmd.
That is not good for performace, we'd better to reduce this kind of copy.
There are some ways to reduce it, for example lessen swappiness or
madvise(, , MADV_MERGEABLE) range. So add this event to support doing
this tuning. Just like this patch: "mm, THP, swap: add THP swapping out
fallback counting".
Link: https://lkml.kernel.org/r/20220113023839.758845-1-yang.yang29@zte.com.cn Signed-off-by: Yang Yang <yang.yang29@zte.com.cn> Reviewed-by: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Hugh Dickins <hughd@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Saravanan D <saravanand@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Johannes Weiner [Tue, 22 Mar 2022 21:46:30 +0000 (14:46 -0700)]
mm: page_io: fix psi memory pressure error on cold swapins
Once upon a time, all swapins counted toward memory pressure[1]. Then
Joonsoo introduced workingset detection for anonymous pages and we gained
the ability to distinguish hot from cold swapins[2][3]. But we failed to
update swap_readpage() accordingly, and now we account partial memory
pressure in the swapin path of cold memory.
Not for all situations - which adds more inconsistency: paths using the
conventional submit_bio() and lock_page() route will not see much pressure
- unless storage itself is heavily congested and the bio submissions
stall. ZRAM and ZSWAP do most of the work directly from swap_readpage()
and will see all swapins reflected as pressure.
IOW, a workload doing cold swapins could see little to no pressure
reported with on-disk swap, but potentially high pressure with a zram or
zswap backend. That confuses any psi-based health monitoring, load
shedding, proactive reclaim, or userspace OOM killing schemes that might
be in place for the workload.
Restore consistency by making all swapin stall accounting conditional on
the page actually being part of the workingset.
[1] commit 9c75b570ac89 ("mm/page_io.c: annotate refault stalls from swap_readpage")
[2] commit ddbc26d24535 ("mm/swap: implement workingset detection for anonymous LRU")
[3] commit 6e10a5dbcb2f ("mm/swap: don't SetPageWorkingset unconditionally during swapin")
Link: https://lkml.kernel.org/r/20220214214921.419687-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: CGEL <cgel.zte@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Tue, 22 Mar 2022 21:46:27 +0000 (14:46 -0700)]
memory tiering: skip to scan fast memory
If the NUMA balancing isn't used to optimize the page placement among
sockets but only among memory types, the hot pages in the fast memory
node couldn't be migrated (promoted) to anywhere. So it's unnecessary
to scan the pages in the fast memory node via changing their PTE/PMD
mapping to be PROT_NONE. So that the page faults could be avoided too.
In the test, if only the memory tiering NUMA balancing mode is enabled,
the number of the NUMA balancing hint faults for the DRAM node is
reduced to almost 0 with the patch. While the benchmark score doesn't
change visibly.
Link: https://lkml.kernel.org/r/20220221084529.1052339-4-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Suggested-by: Dave Hansen <dave.hansen@linux.intel.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Tue, 22 Mar 2022 21:46:23 +0000 (14:46 -0700)]
NUMA balancing: optimize page placement for memory tiering system
With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are usually
different.
In such system, because of the memory accessing pattern changing etc,
some pages in the slow memory may become hot globally. So in this
patch, the NUMA balancing mechanism is enhanced to optimize the page
placement among the different memory types according to hot/cold
dynamically.
In a typical memory tiering system, there are CPUs, fast memory and slow
memory in each physical NUMA node. The CPUs and the fast memory will be
put in one logical node (called fast memory node), while the slow memory
will be put in another (faked) logical node (called slow memory node).
That is, the fast memory is regarded as local while the slow memory is
regarded as remote. So it's possible for the recently accessed pages in
the slow memory node to be promoted to the fast memory node via the
existing NUMA balancing mechanism.
The original NUMA balancing mechanism will stop to migrate pages if the
free memory of the target node becomes below the high watermark. This
is a reasonable policy if there's only one memory type. But this makes
the original NUMA balancing mechanism almost do not work to optimize
page placement among different memory types. Details are as follows.
It's the common cases that the working-set size of the workload is
larger than the size of the fast memory nodes. Otherwise, it's
unnecessary to use the slow memory at all. So, there are almost always
no enough free pages in the fast memory nodes, so that the globally hot
pages in the slow memory node cannot be promoted to the fast memory
node. To solve the issue, we have 2 choices as follows,
a. Ignore the free pages watermark checking when promoting hot pages
from the slow memory node to the fast memory node. This will
create some memory pressure in the fast memory node, thus trigger
the memory reclaiming. So that, the cold pages in the fast memory
node will be demoted to the slow memory node.
b. Define a new watermark called wmark_promo which is higher than
wmark_high, and have kswapd reclaiming pages until free pages reach
such watermark. The scenario is as follows: when we want to promote
hot-pages from a slow memory to a fast memory, but fast memory's free
pages would go lower than high watermark with such promotion, we wake
up kswapd with wmark_promo watermark in order to demote cold pages and
free us up some space. So, next time we want to promote hot-pages we
might have a chance of doing so.
The choice "a" may create high memory pressure in the fast memory node.
If the memory pressure of the workload is high, the memory pressure
may become so high that the memory allocation latency of the workload
is influenced, e.g. the direct reclaiming may be triggered.
The choice "b" works much better at this aspect. If the memory
pressure of the workload is high, the hot pages promotion will stop
earlier because its allocation watermark is higher than that of the
normal memory allocation. So in this patch, choice "b" is implemented.
A new zone watermark (WMARK_PROMO) is added. Which is larger than the
high watermark and can be controlled via watermark_scale_factor.
In addition to the original page placement optimization among sockets,
the NUMA balancing mechanism is extended to be used to optimize page
placement according to hot/cold among different memory types. So the
sysctl user space interface (numa_balancing) is extended in a backward
compatible way as follow, so that the users can enable/disable these
functionality individually.
The sysctl is converted from a Boolean value to a bits field. The
definition of the flags is,
We have tested the patch with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent
Memory Model. The test results shows that the pmbench score can
improve up to 95.9%.
Thanks Andrew Morton to help fix the document format error.
Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Feng Tang <feng.tang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Huang Ying [Tue, 22 Mar 2022 21:46:20 +0000 (14:46 -0700)]
NUMA Balancing: add page promotion counter
Patch series "NUMA balancing: optimize memory placement for memory tiering system", v13
With the advent of various new memory types, some machines will have
multiple types of memory, e.g. DRAM and PMEM (persistent memory). The
memory subsystem of these machines can be called memory tiering system,
because the performance of the different types of memory are different.
After commit b62f840a5a6f ("device-dax: "Hotplug" persistent memory for
use like normal RAM"), the PMEM could be used as the cost-effective
volatile memory in separate NUMA nodes. In a typical memory tiering
system, there are CPUs, DRAM and PMEM in each physical NUMA node. The
CPUs and the DRAM will be put in one logical node, while the PMEM will
be put in another (faked) logical node.
To optimize the system overall performance, the hot pages should be
placed in DRAM node. To do that, we need to identify the hot pages in
the PMEM node and migrate them to DRAM node via NUMA migration.
In the original NUMA balancing, there are already a set of existing
mechanisms to identify the pages recently accessed by the CPUs in a node
and migrate the pages to the node. So we can reuse these mechanisms to
build the mechanisms to optimize the page placement in the memory
tiering system. This is implemented in this patchset.
At the other hand, the cold pages should be placed in PMEM node. So, we
also need to identify the cold pages in the DRAM node and migrate them
to PMEM node.
In commit a1ca6f335d1c ("mm/migrate: demote pages during reclaim"), a
mechanism to demote the cold DRAM pages to PMEM node under memory
pressure is implemented. Based on that, the cold DRAM pages can be
demoted to PMEM node proactively to free some memory space on DRAM node
to accommodate the promoted hot PMEM pages. This is implemented in this
patchset too.
We have tested the solution with the pmbench memory accessing benchmark
with the 80:20 read/write ratio and the Gauss access address
distribution on a 2 socket Intel server with Optane DC Persistent Memory
Model. The test results shows that the pmbench score can improve up to
95.9%.
This patch (of 3):
In a system with multiple memory types, e.g. DRAM and PMEM, the CPU
and DRAM in one socket will be put in one NUMA node as before, while
the PMEM will be put in another NUMA node as described in the
description of the commit b62f840a5a6f ("device-dax: "Hotplug"
persistent memory for use like normal RAM"). So, the NUMA balancing
mechanism will identify all PMEM accesses as remote access and try to
promote the PMEM pages to DRAM.
To distinguish the number of the inter-type promoted pages from that of
the inter-socket migrated pages. A new vmstat count is added. The
counter is per-node (count in the target node). So this can be used to
identify promotion imbalance among the NUMA nodes.
Link: https://lkml.kernel.org/r/20220301085329.3210428-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220221084529.1052339-1-ying.huang@intel.com Link: https://lkml.kernel.org/r/20220221084529.1052339-2-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Wei Xu <weixugc@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hari Bathini [Tue, 22 Mar 2022 21:46:17 +0000 (14:46 -0700)]
powerpc/fadump: opt out from freeing pages on cma activation failure
With commit 1033151e9b41 ("powerpc/fadump: Reservationless firmware
assisted dump"), Linux kernel's Contiguous Memory Allocator (CMA) based
reservation was introduced in fadump. That change was aimed at using CMA
to let applications utilize the memory reserved for fadump while blocking
it from being used for kernel pages. The assumption was, even if CMA
activation fails for whatever reason, the memory still remains reserved to
avoid it from being used for kernel pages. But commit 023addc76f0a
("mm/cma: expose all pages to the buddy if activation of an area fails")
breaks this assumption as it started exposing all pages to buddy allocator
on CMA activation failure. It led to warning messages like below while
running crash-utility on vmcore of a kernel having above two commits:
Hari Bathini [Tue, 22 Mar 2022 21:46:14 +0000 (14:46 -0700)]
mm/cma: provide option to opt out from exposing pages on activation failure
Patch series "powerpc/fadump: handle CMA activation failure appropriately", v3.
Commit 023addc76f0a ("mm/cma: expose all pages to the buddy if
activation of an area fails") started exposing all pages to buddy
allocator on CMA activation failure. But there can be CMA users that
want to handle the reserved memory differently on CMA allocation
failure.
Provide an option to opt out from exposing pages to buddy for such
cases.
Link: https://lkml.kernel.org/r/20220117075246.36072-1-hbathini@linux.ibm.com Link: https://lkml.kernel.org/r/20220117075246.36072-2-hbathini@linux.ibm.com Signed-off-by: Hari Bathini <hbathini@linux.ibm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mahesh Salgaonkar <mahesh@linux.ibm.com> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 22 Mar 2022 21:46:11 +0000 (14:46 -0700)]
mm/thp: refix __split_huge_pmd_locked() for migration PMD
Migration entries do not contribute to a page's reference count: move
__split_huge_pmd_locked()'s page_ref_add() into pmd_migration's else
block (along with the page_count() check - a page is quite likely to
have reference count frozen to 0 when a migration entry is found).
This will fix a very rare anonymous memory leak, after a
split_huge_pmd() raced with an anon split_huge_page() or an anon THP
migrate_pages(): since the wrongly raised refcount stopped the page
(perhaps small, perhaps huge, depending on when the race hit) from ever
being freed.
At first I thought there were worse risks, from prematurely unfreezing a
frozen page: but now think that would only affect page cache pages,
which do not come this way (except for anonymous pages in swap cache,
perhaps).
Link: https://lkml.kernel.org/r/84792468-f512-e48f-378c-e34c3641e97@google.com Fixes: c43d124cc97e ("mm/thp: fix __split_huge_pmd_locked() for migration PMD") Signed-off-by: Hugh Dickins <hughd@google.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
andrew.yang [Tue, 22 Mar 2022 21:46:08 +0000 (14:46 -0700)]
mm/migrate: fix race between lock page and clear PG_Isolated
When memory is tight, system may start to compact memory for large
continuous memory demands. If one process tries to lock a memory page
that is being locked and isolated for compaction, it may wait a long time
or even forever. This is because compaction will perform non-atomic
PG_Isolated clear while holding page lock, this may overwrite PG_waiters
set by the process that can't obtain the page lock and add itself to the
waiting queue to wait for the lock to be unlocked.
Huang Ying [Tue, 22 Mar 2022 21:46:05 +0000 (14:46 -0700)]
mm,migrate: fix establishing demotion target
In commit ad338e16d3db ("mm: migrate: support multiple target nodes
demotion"), after the first demotion target node is found, we will
continue to check the next candidate obtained via find_next_best_node().
This is to find all demotion target nodes with same NUMA distance. But
one side effect of find_next_best_node() is that the candidate node
returned will be set in "used" parameter, even if the candidate node isn't
passed in the following NUMA distance checking, the candidate node will
not be used as demotion target node for the following nodes. For example,
for system as follows,
when we establish demotion target node for node 0, in the first round node
2 is added to the demotion target node set. Then in the second round,
node 3 is checked and failed because distance(0, 3) > distance(0, 2). But
node 3 is set in "used" nodemask too. When we establish demotion target
node for node 1, there is no available node. This is wrong, node 3 should
be set as the demotion target of node 1.
To fix this, if the candidate node is failed to pass the distance
checking, it will be cleared in "used" nodemask. So that it can be used
for the following node.
The bug can be reproduced and fixed with this patch on a 2 socket server
machine with DRAM and PMEM.
Link: https://lkml.kernel.org/r/20220128055940.1792614-1-ying.huang@intel.com Fixes: ad338e16d3db ("mm: migrate: support multiple target nodes demotion") Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com> Cc: Xunlei Pang <xlpang@linux.alibaba.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 22 Mar 2022 21:45:59 +0000 (14:45 -0700)]
mempolicy: mbind_range() set_policy() after vma_merge()
v2.6.34 commit d48f50e31fa4 ("mm: fix mbind vma merge problem") introduced
vma_merge() to mbind_range(); but unlike madvise, mlock and mprotect, it
put a "continue" to next vma where its precedents go to update flags on
current vma before advancing: that left vma with the wrong setting in the
infamous vma_merge() case 8.
v3.10 commit 344ed2c40a43 ("mm: merging memory blocks resets mempolicy")
tried to fix that in vma_adjust(), without fully understanding the issue.
v3.11 commit e7b0d2d2cc4c ("mm: mempolicy: fix mbind_range() &&
vma_adjust() interaction") reverted that, and went about the fix in the
right way, but chose to optimize out an unnecessary mpol_dup() with a
prior mpol_equal() test. But on tmpfs, that also pessimized out the vital
call to its ->set_policy(), leaving the new mbind unenforced.
The user visible effect was that the pages got allocated on the local
node (happened to be 0), after the mbind() caller had specifically
asked for them to be allocated on node 1. There was not any page
migration involved in the case reported: the pages simply got allocated
on the wrong node.
Just delete that optimization now (though it could be made conditional on
vma not having a set_policy). Also remove the "next" variable: it turned
out to be blameless, but also pointless.
Baolin Wang [Tue, 22 Mar 2022 21:45:56 +0000 (14:45 -0700)]
mm: compaction: cleanup the compaction trace events
As Steven suggested [1], we should access the pointers from the trace
event to avoid dereferencing them to the tracepoint function when the
tracepoint is disabled.
mm: vmscan: fix documentation for page_check_references()
Commit b6df587f15fc ("mm/vmscan: protect the workingset on anonymous
LRU") requires to look twice for both mapped anon/file pages are used
more than once to take the decission of reclaim or activation. Correct
the documentation accordingly.
Link: https://lkml.kernel.org/r/1646925640-21324-1-git-send-email-quic_charante@quicinc.com Signed-off-by: Charan Teja Kalla <quic_charante@quicinc.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: workingset: replace IRQ-off check with a lockdep assert.
Commit 9df6368ade033 ("mm: workingset: add vmstat counter for shadow
nodes") introduced an IRQ-off check to ensure that a lock is held which
also disabled interrupts. This does not work the same way on PREEMPT_RT
because none of the locks, that are held, disable interrupts.
Replace this check with a lockdep assert which ensures that the lock is
held.
Link: https://lkml.kernel.org/r/20220301122143.1521823-3-bigeasy@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Marcelo Tosatti [Tue, 22 Mar 2022 21:45:47 +0000 (14:45 -0700)]
mm: lru_cache_disable: replace work queue synchronization with synchronize_rcu
On systems that run FIFO:1 applications that busy loop, any SCHED_OTHER
task that attempts to execute on such a CPU (such as work threads) will
not be scheduled, which leads to system hangs.
Commit 3b82afd8791892578 ("mm: disable LRU pagevec during the migration
temporarily") relies on queueing work items on all online CPUs to ensure
visibility of lru_disable_count.
To fix this, replace the usage of work items with synchronize_rcu,
which provides the same guarantees.
Readers of lru_disable_count are protected by either disabling
preemption or rcu_read_lock:
Since v5.1 kernel, synchronize_rcu() is guaranteed to wait on
preempt_disable() regions of code. So any CPU which sees
lru_disable_count = 0 will have exited the critical section when
synchronize_rcu() returns.
Link: https://lkml.kernel.org/r/Yin7hDxdt0s/x+fp@fuller.cnet Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Reviewed-by: Nicolas Saenz Julienne <nsaenzju@redhat.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Paul E. McKenney <paulmck@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since commit 7a950e9c2c53 ("mm/list_lru.c: fix list_lru_count_node() to
be race free"), we are tracking the total number of lru entries in a
list_lru_node in its nr_items field.
In the case of memcg_reparent_list_lru_node(), there is nothing to be
done if nr_items is 0. We don't even need to take the nlru->lock as no
new lru entry could be added by a racing list_lru_add() to the draining
src_idx memcg at this point.
On systems that serve a lot of containers, it is possible that there can
be thousands of list_lru's present due to the fact that each container
may mount its own container specific filesystems. As a typical
container uses only a few cpus, it is likely that only the list_lru_node
that contains those cpus will be utilized while the rests may be empty.
In other words, there can be a lot of list_lru_node with 0 nr_items.
By skipping a lock/unlock operation and loading a cacheline from
memcg_lrus, a sizeable number of cpu cycles can be saved. That can be
substantial if we are talking about thousands of list_lru_node's with 0
nr_items.
Link: https://lkml.kernel.org/r/20220309144000.1470138-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 22 Mar 2022 21:45:41 +0000 (14:45 -0700)]
mm: __isolate_lru_page_prepare() in isolate_migratepages_block()
__isolate_lru_page_prepare() conflates two unrelated functions, with the
flags to one disjoint from the flags to the other; and hides some of the
important checks outside of isolate_migratepages_block(), where the
sequence is better to be visible. It comes from the days of lumpy
reclaim, before compaction, when the combination made more sense.
Move what's needed by mm/compaction.c isolate_migratepages_block() inline
there, and what's needed by mm/vmscan.c isolate_lru_pages() inline there.
Shorten "isolate_mode" to "mode", so the sequence of conditions is easier
to read. Declare a "mapping" variable, to save one call to page_mapping()
(but not another: calling again after page is locked is necessary).
Simplify isolate_lru_pages() with a "move_to" list pointer.
Link: https://lkml.kernel.org/r/879d62a8-91cc-d3c6-fb3b-69768236df68@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: David Rientjes <rientjes@google.com> Reviewed-by: Alex Shi <alexs@kernel.org> Cc: Alexander Duyck <alexander.duyck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Hugh Dickins [Tue, 22 Mar 2022 21:45:38 +0000 (14:45 -0700)]
mm/fs: delete PF_SWAPWRITE
PF_SWAPWRITE has been redundant since v3.2 commit 0a365446b381 ("mm:
vmscan: do not writeback filesystem pages in direct reclaim").
Coincidentally, NeilBrown's current patch "remove inode_congested()"
deletes may_write_to_inode(), which appeared to be the one function which
took notice of PF_SWAPWRITE. But if you study the old logic, and the
conditions under which may_write_to_inode() was called, you discover that
flag and function have been pointless for a decade.
Link: https://lkml.kernel.org/r/75e80e7-742d-e3bd-531-614db8961e4@google.com Signed-off-by: Hugh Dickins <hughd@google.com> Cc: NeilBrown <neilb@suse.de> Cc: Jan Kara <jack@suse.de> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix following coccicheck warning:
tools/testing/selftests/vm/userfaultfd.c:556:23-24:
WARNING this kind of initialization is deprecated
`unsigned long page_nr = *(&page_nr)` has the same form of
uninitialized_var() macro. I remove the redundant assignement. It has
been tested with gcc (Debian 8.3.0-6) 8.3.0.
The patch which removed uninitialized_var() is:
https://lore.kernel.org/all/20121028102007.GA7547@gmail.com/ And there is
very few "/* GCC */" comments in the Linux kernel code now.
Nadav Amit [Tue, 22 Mar 2022 21:45:32 +0000 (14:45 -0700)]
userfaultfd: provide unmasked address on page-fault
Userfaultfd is supposed to provide the full address (i.e., unmasked) of
the faulting access back to userspace. However, that is not the case for
quite some time.
Even running "userfaultfd_demo" from the userfaultfd man page provides the
wrong output (and contradicts the man page). Notice that
"UFFD_EVENT_PAGEFAULT event" shows the masked address (7fc5e30b3000) and
not the first read address (0x7fc5e30b300f).
Address returned by mmap() = 0x7fc5e30b3000
fault_handler_thread():
poll() returns: nready = 1; POLLIN = 1; POLLERR = 0
UFFD_EVENT_PAGEFAULT event: flags = 0; address = 7fc5e30b3000
(uffdio_copy.copy returned 4096)
Read address 0x7fc5e30b300f in main(): A
Read address 0x7fc5e30b340f in main(): A
Read address 0x7fc5e30b380f in main(): A
Read address 0x7fc5e30b3c0f in main(): A
The exact address is useful for various reasons and specifically for
prefetching decisions. If it is known that the memory is populated by
certain objects whose size is not page-aligned, then based on the faulting
address, the uffd-monitor can decide whether to prefetch and prefault the
adjacent page.
This bug has been for quite some time in the kernel: since commit ce0aac205752 ("mm: use vmf->address instead of of vmf->virtual_address")
vmf->virtual_address"), which dates back to 2016. A concern has been
raised that existing userspace application might rely on the old/wrong
behavior in which the address is masked. Therefore, it was suggested to
provide the masked address unless the user explicitly asks for the exact
address.
Add a new userfaultfd feature UFFD_FEATURE_EXACT_ADDRESS to direct
userfaultfd to provide the exact address. Add a new "real_address" field
to vmf to hold the unmasked address. Provide the address to userspace
accordingly.
Initialize real_address in various code-paths to be consistent with
address, even when it is not used, to be on the safe side.
Link: https://lkml.kernel.org/r/20220218041003.3508-1-namit@vmware.com Signed-off-by: Nadav Amit <namit@vmware.com> Acked-by: Peter Xu <peterx@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Jan Kara <jack@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Tue, 22 Mar 2022 21:45:26 +0000 (14:45 -0700)]
mm/hugetlb.c: export PageHeadHuge()
Export PageHeadHuge() - it's used by folio_test_hugetlb() and thence by
such as folio_file_page() and folio_contains(). Matthew suggested I use
the first of those instead of doing the same calculation manually - but I
can't call it from a module.
Kirill suggested rearranging things to put it in a header, but that
introduces header dependencies because of where constants are defined.
[akpm@linux-foundation.org: s/EXPORT_SYMBOL/EXPORT_SYMBOL_GPL/, per Christoph]
Mike Kravetz [Tue, 22 Mar 2022 21:45:20 +0000 (14:45 -0700)]
hugetlb: clean up potential spectre issue warnings
Recently introduced code allows numa nodes to be specified on the kernel
command line for hugetlb allocations or CMA reservations. The node
values are user specified and used as indicies into arrays. This
generated the following smatch warnings:
ARCH_WANT_GENERAL_HUGETLB config has duplicate definitions on platforms
that subscribe it. Instead make it a generic config option which can be
selected on applicable platforms when required.
Link: https://lkml.kernel.org/r/1643718465-4324-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:45:09 +0000 (14:45 -0700)]
selftests: vm: add a hugetlb test case
Since the head vmemmap page frame associated with each HugeTLB page is
reused, we should hide the PG_head flag of tail struct page from the
user. Add a tese case to check whether it is work properly. The test
steps are as follows.
1) alloc 2MB hugeTLB
2) get each page frame
3) apply those APIs in each page frame
4) Those APIs work completely the same as before.
Reading the flags of a page by /proc/kpageflags is done in
stable_page_flags(), which has invoked PageHead(), PageTail(),
PageCompound() and compound_head().
If those APIs work properly, the head page must have 15 and 17 bits set.
And tail pages must have 16 and 17 bits set but 15 bit unset. Those
flags are checked in check_page_flags().
Link: https://lkml.kernel.org/r/20211101031651.75851-5-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:45:06 +0000 (14:45 -0700)]
mm: sparsemem: use page table lock to protect kernel pmd operations
The init_mm.page_table_lock is used to protect kernel page tables, we
can use it to serialize splitting vmemmap PMD mappings instead of mmap
write lock, which can increase the concurrency of vmemmap_remap_free().
Actually, It increase the concurrency between allocations of HugeTLB
pages. But it is not the only benefit. There are a lot of users of
mmap read lock of init_mm. The mmap write lock is holding through
vmemmap_remap_free(), removing mmap write lock usage to make it does not
affect other users of mmap read lock. It is not making anything worse
and always a win to move.
Now the kernel page table walker does not hold the page_table_lock when
walking pmd entries. There may be consistency issue of a pmd entry,
because pmd entry might change from a huge pmd entry to a PTE page
table. There is only one user of kernel page table walker, namely
ptdump. The ptdump already considers the consistency, which use a local
variable to cache the value of pmd entry. But we also need to update
->action to ACTION_CONTINUE to make sure the walker does not walk every
pte entry again when concurrent thread has split the huge pmd.
Link: https://lkml.kernel.org/r/20211101031651.75851-4-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:45:03 +0000 (14:45 -0700)]
mm: hugetlb: replace hugetlb_free_vmemmap_enabled with a static_key
The page_fixed_fake_head() is used throughout memory management and the
conditional check requires checking a global variable, although the
overhead of this check may be small, it increases when the memory cache
comes under pressure. Also, the global variable will not be modified
after system boot, so it is very appropriate to use static key machanism.
Link: https://lkml.kernel.org/r/20211101031651.75851-3-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:45:00 +0000 (14:45 -0700)]
mm: hugetlb: free the 2nd vmemmap page associated with each HugeTLB page
Patch series "Free the 2nd vmemmap page associated with each HugeTLB
page", v7.
This series can minimize the overhead of struct page for 2MB HugeTLB
pages significantly. It further reduces the overhead of struct page by
12.5% for a 2MB HugeTLB compared to the previous approach, which means
2GB per 1TB HugeTLB. It is a nice gain. Comments and reviews are
welcome. Thanks.
The main implementation and details can refer to the commit log of patch
1. In this series, I have changed the following four helpers, the
following table shows the impact of the overhead of those helpers.
+------------------+-----------------------+
| APIs | head page | tail page |
+------------------+-----------+-----------+
| PageHead() | Y | N |
+------------------+-----------+-----------+
| PageTail() | Y | N |
+------------------+-----------+-----------+
| PageCompound() | N | N |
+------------------+-----------+-----------+
| compound_head() | Y | N |
+------------------+-----------+-----------+
Y: Overhead is increased.
N: Overhead is _NOT_ increased.
It shows that the overhead of those helpers on a tail page don't change
between "hugetlb_free_vmemmap=on" and "hugetlb_free_vmemmap=off". But the
overhead on a head page will be increased when "hugetlb_free_vmemmap=on"
(except PageCompound()). So I believe that Matthew Wilcox's folio series
will help with this.
The users of PageHead() and PageTail() are much less than compound_head()
and most users of PageTail() are VM_BUG_ON(), so I have done some tests
about the overhead of compound_head() on head pages.
I have tested the overhead of calling compound_head() on a head page,
which is 2.11ns (Measure the call time of 10 million times
compound_head(), and then average).
For a head page whose address is not aligned with PAGE_SIZE or a
non-compound page, the overhead of compound_head() is 2.54ns which is
increased by 20%. For a head page whose address is aligned with
PAGE_SIZE, the overhead of compound_head() is 2.97ns which is increased by
40%. Most pages are the former. I do not think the overhead is
significant since the overhead of compound_head() itself is low.
This patch (of 5):
This patch minimizes the overhead of struct page for 2MB HugeTLB pages
significantly. It further reduces the overhead of struct page by 12.5%
for a 2MB HugeTLB compared to the previous approach, which means 2GB per
1TB HugeTLB (2MB type).
After the feature of "Free sonme vmemmap pages of HugeTLB page" is
enabled, the mapping of the vmemmap addresses associated with a 2MB
HugeTLB page becomes the figure below.
As we can see, the 2nd vmemmap page frame (indexed by 1) is reused and
remaped. However, the 2nd vmemmap page frame is also can be freed to
the buddy allocator, then we can change the mapping from the figure
above to the figure below.
After we do this, all tail vmemmap pages (1-7) are mapped to the head
vmemmap page frame (0). In other words, there are more than one page
struct with PG_head associated with each HugeTLB page. We __know__ that
there is only one head page struct, the tail page structs with PG_head are
fake head page structs. We need an approach to distinguish between those
two different types of page structs so that compound_head(), PageHead()
and PageTail() can work properly if the parameter is the tail page struct
but with PG_head.
The following code snippet describes how to distinguish between real and
fake head page struct.
if (test_bit(PG_head, &page->flags)) {
unsigned long head = READ_ONCE(page[1].compound_head);
if (head & 1) {
if (head == (unsigned long)page + 1)
==> head page struct
else
==> tail page struct
} else
==> head page struct
}
We can safely access the field of the @page[1] with PG_head because the
@page is a compound page composed with at least two contiguous pages.
[songmuchun@bytedance.com: restore lost comment changes]
Link: https://lkml.kernel.org/r/20211101031651.75851-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20211101031651.75851-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Barry Song <song.bao.hua@hisilicon.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Chen Huang <chenhuang5@huawei.com> Cc: Bodeddula Balasubramaniam <bodeddub@amazon.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
user_shm_lock forgets to set allowed to 0 when get_ucounts fails. So
the later user_shm_unlock might do the extra dec_rlimit_ucounts. Fix
this by resetting allowed to 0.
Link: https://lkml.kernel.org/r/20220310132417.41189-1-linmiaohe@huawei.com Fixes: 3b7f54e42fe7 ("Reimplement RLIMIT_MEMLOCK on top of ucounts") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Hugh Dickins <hughd@google.com> Cc: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Cc: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vlastimil Babka [Tue, 22 Mar 2022 21:44:53 +0000 (14:44 -0700)]
mm, fault-injection: declare should_fail_alloc_page()
The mm/ directory can almost fully be built with W=1, which would help
in local development. One remaining issue is missing prototype for
should_fail_alloc_page(). Thus add it next to the should_failslab()
prototype.
Note the previous attempt by commit 6bba2f5d063d ("mm/page_alloc: make
should_fail_alloc_page() static") had to be reverted by commit dfe7ed49bdbd as it caused an unresolved symbol error with
CONFIG_DEBUG_INFO_BTF=y
Link: https://lkml.kernel.org/r/20220314165724.16071-1-vbabka@suse.cz Signed-off-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:44:50 +0000 (14:44 -0700)]
mm/memory-failure.c: make non-LRU movable pages unhandlable
We can not really handle non-LRU movable pages in memory failure.
Typically they are balloon, zsmalloc, etc.
Assuming we run into a base (4K) non-LRU movable page, we could reach as
far as identify_page_state(), it should not fall into any category
except me_unknown.
For the non-LRU compound movable pages, they could be taken for
transhuge pages but it's unexpected to split non-LRU movable pages using
split_huge_page_to_list in memory_failure. So we could just simply make
non-LRU movable pages unhandlable to avoid these possible nasty cases.
Link: https://lkml.kernel.org/r/20220312074613.4798-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Suggested-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:44:47 +0000 (14:44 -0700)]
mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages
Since commit 042c4f32323b ("mm/truncate: Inline invalidate_complete_page()
into its one caller"), invalidate_inode_page() can invalidate the pages
in the swap cache because the check of page->mapping != mapping is
removed. But invalidate_inode_page() is not expected to deal with the
pages in swap cache. Also non-lru movable page can reach here too.
They're not page cache pages. Skip these pages by checking
PageSwapCache and PageLRU.
Link: https://lkml.kernel.org/r/20220312074613.4798-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:44:44 +0000 (14:44 -0700)]
mm/memory-failure.c: fix race with changing page compound again
Patch series "A few fixup patches for memory failure", v2.
This series contains a few patches to fix the race with changing page
compound page, make non-LRU movable pages unhandlable and so on. More
details can be found in the respective changelogs.
There is a race window where we got the compound_head, the hugetlb page
could be freed to buddy, or even changed to another compound page just
before we try to get hwpoison page. Think about the below race window:
CPU 1 CPU 2
memory_failure_hugetlb
struct page *head = compound_head(p);
hugetlb page might be freed to
buddy, or even changed to another
compound page.
get_hwpoison_page -- page is not what we want now...
If this race happens, just bail out. Also MF_MSG_DIFFERENT_PAGE_SIZE is
introduced to record this event.
[akpm@linux-foundation.org: s@/**@/*@, per Naoya Horiguchi]
After successfully obtaining the reference count of the huge page, it is
still necessary to call hwpoison_filter() to make a filter judgement,
otherwise the filter hugepage will be unmaped and the related process
may be killed.
Link: https://lkml.kernel.org/r/20220223082254.2769757-1-luofei@unicloud.com Signed-off-by: luofei <luofei@unicloud.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
luofei [Tue, 22 Mar 2022 21:44:38 +0000 (14:44 -0700)]
mm/hwpoison: avoid the impact of hwpoison_filter() return value on mce handler
When the hwpoison page meets the filter conditions, it should not be
regarded as successful memory_failure() processing for mce handler, but
should return a distinct value, otherwise mce handler regards the error
page has been identified and isolated, which may lead to calling
set_mce_nospec() to change page attribute, etc.
Here memory_failure() return -EOPNOTSUPP to indicate that the error
event is filtered, mce handler should not take any action for this
situation and hwpoison injector should treat as correct.
Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com Signed-off-by: luofei <luofei@unicloud.com> Acked-by: Borislav Petkov <bp@suse.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:44:35 +0000 (14:44 -0700)]
mm/hwpoison-inject: support injecting hwpoison to free page
memory_failure() can handle free buddy page. Support injecting hwpoison
to free page by adding is_free_buddy_page check when hwpoison filter is
disabled.
[akpm@linux-foundation.org: export is_free_buddy_page() to modules]
Link: https://lkml.kernel.org/r/20220218092052.3853-1-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:44:24 +0000 (14:44 -0700)]
mm/memory-failure.c: remove PageSlab check in hwpoison_filter_dev
Since commit 836af871a108 ("mm: fix crash when using XFS on loopback"),
page_mapping() can handle the Slab pages. So remove this unnecessary
PageSlab check and obsolete comment.
Link: https://lkml.kernel.org/r/20220218090118.1105-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:44:21 +0000 (14:44 -0700)]
mm/memory-failure.c: fix race with changing page more robustly
We're only intended to deal with the non-Compound page after we split
thp in memory_failure. However, the page could have changed compound
pages due to race window. If this happens, we could retry once to
hopefully handle the page next round. Also remove unneeded orig_head.
It's always equal to the hpage. So we can use hpage directly and remove
this redundant one.
Link: https://lkml.kernel.org/r/20220218090118.1105-5-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:44:15 +0000 (14:44 -0700)]
mm/memory-failure.c: catch unexpected -EFAULT from vma_address()
It's unexpected to walk the page table when vma_address() return
-EFAULT. But dev_pagemap_mapping_shift() is called only when vma
associated to the error page is found already in
collect_procs_{file,anon}, so vma_address() should not return -EFAULT
except with some bug, as Naoya pointed out. We can use VM_BUG_ON_VMA()
to catch this bug here.
Link: https://lkml.kernel.org/r/20220218090118.1105-3-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:44:12 +0000 (14:44 -0700)]
mm/memory-failure.c: minor clean up for memory_failure_dev_pagemap
Patch series "A few cleanup and fixup patches for memory failure", v3.
This series contains a few patches to simplify the code logic, remove
unneeded variable and remove obsolete comment. Also we fix race
changing page more robustly in memory_failure. More details can be
found in the respective changelogs.
This patch (of 8):
The flags always has MF_ACTION_REQUIRED and MF_MUST_KILL set. So we do
not need to check these flags again.
Rik van Riel [Tue, 22 Mar 2022 21:44:09 +0000 (14:44 -0700)]
mm: invalidate hwpoison page cache page in fault path
Sometimes the page offlining code can leave behind a hwpoisoned clean
page cache page. This can lead to programs being killed over and over
and over again as they fault in the hwpoisoned page, get killed, and
then get re-spawned by whatever wanted to run them.
This is particularly embarrassing when the page was offlined due to
having too many corrected memory errors. Now we are killing tasks due
to them trying to access memory that probably isn't even corrupted.
This problem can be avoided by invalidating the page from the page fault
handler, which already has a branch for dealing with these kinds of
pages. With this patch we simply pretend the page fault was successful
if the page was invalidated, return to userspace, incur another page
fault, read in the file from disk (to a new memory page), and then
everything works again.
Link: https://lkml.kernel.org/r/20220212213740.423efcea@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Tue, 22 Mar 2022 21:44:06 +0000 (14:44 -0700)]
mm/hwpoison: fix error page recovered but reported "not recovered"
When an uncorrected memory error is consumed there is a race between the
CMCI from the memory controller reporting an uncorrected error with a
UCNA signature, and the core reporting and SRAR signature machine check
when the data is about to be consumed.
If the CMCI wins that race, the page is marked poisoned when
uc_decode_notifier() calls memory_failure() and the machine check
processing code finds the page already poisoned. It calls
kill_accessing_process() to make sure a SIGBUS is sent. But returns the
wrong error code.
Console log looks like this:
mce: Uncorrected hardware memory error in user-access at 3710b3400
Memory failure: 0x3710b3: recovery action for dirty LRU page: Recovered
Memory failure: 0x3710b3: already hardware poisoned
Memory failure: 0x3710b3: Sending SIGBUS to einj_mem_uc:361438 due to hardware memory corruption
mce: Memory error not recovered
kill_accessing_process() is supposed to return -EHWPOISON to notify that
SIGBUS is already set to the process and kill_me_maybe() doesn't have to
send it again. But current code simply fails to do this, so fix it to
make sure to work as intended. This change avoids the noise message
"Memory error not recovered" and skips duplicate SIGBUSs.
[tony.luck@intel.com: reword some parts of commit message]
Link: https://lkml.kernel.org/r/20220113231117.1021405-1-naoya.horiguchi@linux.dev Fixes: ba7972378c7f ("mm,hwpoison: send SIGBUS with error virutal address") Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Youquan Song <youquan.song@intel.com> Cc: Tony Luck <tony.luck@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Naoya Horiguchi [Tue, 22 Mar 2022 21:44:03 +0000 (14:44 -0700)]
mm/memory-failure.c: remove obsolete comment
With the introduction of mf_mutex, most of memory error handling process
is mutually exclusive, so the in-line comment about subtlety about
double-checking PageHWPoison is no more correct. So remove it.
Link: https://lkml.kernel.org/r/20220125025601.3054511-1-naoya.horiguchi@linux.dev Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 22 Mar 2022 21:44:00 +0000 (14:44 -0700)]
mm/page_alloc: check high-order pages for corruption during PCP operations
Eric Dumazet pointed out that commit f769c269399e ("mm/page_alloc: allow
high-order pages to be stored on the per-cpu lists") only checks the
head page during PCP refill and allocation operations. This was an
oversight and all pages should be checked. This will incur a small
performance penalty but it's necessary for correctness.
Link: https://lkml.kernel.org/r/20220310092456.GJ15701@techsingularity.net Fixes: f769c269399e ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reported-by: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: count time in drain_all_pages during direct reclaim as memory pressure
When page allocation in direct reclaim path fails, the system will make
one attempt to shrink per-cpu page lists and free pages from high alloc
reserves. Draining per-cpu pages into buddy allocator can be a very
slow operation because it's done using workqueues and the task in direct
reclaim waits for all of them to finish before proceeding. Currently
this time is not accounted as psi memory stall.
While testing mobile devices under extreme memory pressure, when
allocations are failing during direct reclaim, we notices that psi
events which would be expected in such conditions were not triggered.
After profiling these cases it was determined that the reason for
missing psi events was that a big chunk of time spent in direct reclaim
is not accounted as memory stall, therefore psi would not reach the
levels at which an event is generated. Further investigation revealed
that the bulk of that unaccounted time was spent inside drain_all_pages
call.
A typical captured case when drain_all_pages path gets activated:
__alloc_pages_slowpath took 44.644.613ns
__perform_reclaim took 751.668ns (1.7%)
drain_all_pages took 43.887.167ns (98.3%)
PSI in this case records the time spent in __perform_reclaim but ignores
drain_all_pages, IOW it misses 98.3% of the time spent in
__alloc_pages_slowpath.
Annotate __alloc_pages_direct_reclaim in its entirety so that delays
from handling page allocation failure in the direct reclaim path are
accounted as memory stall.
Link: https://lkml.kernel.org/r/20220223194812.1299646-1-surenb@google.com Signed-off-by: Suren Baghdasaryan <surenb@google.com> Reported-by: Tim Murray <timmurray@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Petr Mladek <pmladek@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
numa_register_memblks() is only interested in those nodes which have
memory, so it skips over any memoryless node it founds. Later on, when
we have read ACPI's SRAT table, we call init_cpu_to_node() and
init_gi_nodes(), which initialize any memoryless node we might have that
have either CPU or Initiator affinity, meaning we allocate pg_data_t
struct for them and we mark them as ONLINE.
So far so good, but the thing is that after ("mm: handle uninitialized
numa nodes gracefully"), we allocate all possible NUMA nodes in
free_area_init(), meaning we have a picture like the following:
free_area_init() already allocates all possible NUMA nodes, but
init_cpu_to_node() and init_gi_nodes() are clueless about that, so they
go ahead and allocate a new pg_data_t struct without checking anything,
meaning we end up allocating twice.
It should be mad clear that this only happens in the case where
memoryless NUMA node happens to have a CPU/Initiator affinity.
So get rid of init_memory_less_node() and just set the node online.
Note that setting the node online is needed, otherwise we choke down the
chain when bringup_nonboot_cpus() ends up calling
__try_online_node()->register_one_node()->... and we blow up in
bus_add_device(). As can be seen here:
The reason is simple, by the time bringup_nonboot_cpus() gets called, we
did not register the node_subsys bus yet, so we crash when
bus_add_device() tries to dereference bus()->p.
The following shows the order of the calls:
kernel_init_freeable
smp_init
bringup_nonboot_cpus
...
bus_add_device() <- we did not register node_subsys yet
do_basic_setup
do_initcalls
postcore_initcall(register_node_type);
register_node_type
subsys_system_register
subsys_register
bus_register <- register node_subsys bus
Why setting the node online saves us then? Well, simply because
__try_online_node() backs off when the node is online, meaning we do not
end up calling register_one_node() in the first place.
This is subtle, broken and deserves a deep analysis and thought about
how to put this into shape, but for now let us have this easy fix for
the leaking memory issue.
[osalvador@suse.de: add comments] Link: https://lkml.kernel.org/r/20220221142649.3457-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20220218224302.5282-2-osalvador@suse.de Fixes: da4490c958ad ("mm: handle uninitialized numa nodes gracefully") Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Rafael Aquini <raquini@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Alexey Makhalov <amakhalov@vmware.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 22 Mar 2022 21:43:48 +0000 (14:43 -0700)]
mm/page_alloc: do not prefetch buddies during bulk free
free_pcppages_bulk() has taken two passes through the pcp lists since
commit 9a9b1b64a899 ("mm/free_pcppages_bulk: do not hold lock when
picking pages to free") due to deferring the cost of selecting PCP lists
until the zone lock is held.
As the list processing now takes place under the zone lock, it's less
clear that this will always benefit for two reasons.
1. There is a guaranteed cost to calculating the buddy which definitely
has to be calculated again. However, as the zone lock is held and
there is no deferring of buddy merging, there is no guarantee that the
prefetch will have completed when the second buddy calculation takes
place and buddies are being merged. With or without the prefetch, there
may be further stalls depending on how many pages get merged. In other
words, a stall due to merging is inevitable and at best only one stall
might be avoided at the cost of calculating the buddy location twice.
2. As the zone lock is held, prefetch_nr makes less sense as once
prefetch_nr expires, the cache lines of interest have already been
merged.
The main concern is that there is a definite cost to calculating the
buddy location early for the prefetch and it is a "maybe win" depending
on whether the CPU prefetch logic and memory is fast enough. Remove the
prefetch logic on the basis that reduced instructions in a path is
always a saving where as the prefetch might save one memory stall
depending on the CPU and memory.
In most cases, this has marginal benefit as the calculations are a small
part of the overall freeing of pages. However, it was detectable on at
least one machine.
Mel Gorman [Tue, 22 Mar 2022 21:43:45 +0000 (14:43 -0700)]
mm/page_alloc: limit number of high-order pages on PCP during bulk free
When a PCP is mostly used for frees then high-order pages can exist on
PCP lists for some time. This is problematic when the allocation
pattern is all allocations from one CPU and all frees from another
resulting in colder pages being used. When bulk freeing pages, limit
the number of high-order pages that are stored on the PCP lists.
Netperf running on localhost exhibits this pattern and while it does not
matter for some machines, it does matter for others with smaller caches
where cache misses cause problems due to reduced page reuse. Pages
freed directly to the buddy list may be reused quickly while still cache
hot where as storing on the PCP lists may be cold by the time
free_pcppages_bulk() is called.
Using perf kmem:mm_page_alloc, the 5 most used page frames were
The spread is wider as there is still time before pages freed to one PCP
get released with a tradeoff between fast reuse and reduced zone lock
acquisition.
On the machine used to gather the traces, the headline performance was
equivalent.
Note that this was a machine that did not benefit from caching high-order
pages and performance is almost restored with the series applied. It's
not fully restored as cache misses are still higher. This is a trade-off
between optimising for a workload that does all allocs on one CPU and
frees on another or more general workloads that need high-order pages for
SLUB and benefit from avoiding zone->lock for every SLUB refill/drain.
Link: https://lkml.kernel.org/r/20220217002227.5739-7-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Aaron Lu <aaron.lu@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 22 Mar 2022 21:43:42 +0000 (14:43 -0700)]
mm/page_alloc: free pages in a single pass during bulk free
free_pcppages_bulk() has taken two passes through the pcp lists since
commit 9a9b1b64a899 ("mm/free_pcppages_bulk: do not hold lock when
picking pages to free") due to deferring the cost of selecting PCP lists
until the zone lock is held. Now that list selection is simplier, the
main cost during selection is bulkfree_pcp_prepare() which in the normal
case is a simple check and prefetching. As the list manipulations have
cost in itself, go back to freeing pages in a single pass.
The series up to this point was evaulated using a trunc microbenchmark
that is truncating sparse files stored in page cache (mmtests config
config-io-trunc). Sparse files were used to limit filesystem
interaction. The results versus a revert of storing high-order pages in
the PCP lists is
Mel Gorman [Tue, 22 Mar 2022 21:43:38 +0000 (14:43 -0700)]
mm/page_alloc: drain the requested list first during bulk free
Prior to the series, pindex 0 (order-0 MIGRATE_UNMOVABLE) was always
skipped first and the precise reason is forgotten. A potential reason
may have been to artificially preserve MIGRATE_UNMOVABLE but there is no
reason why that would be optimal as it depends on the workload. The
more likely reason is that it was less complicated to do a pre-increment
instead of a post-increment in terms of overall code flow. As
free_pcppages_bulk() now typically receives the pindex of the PCP list
that exceeded high, always start draining that list.
Link: https://lkml.kernel.org/r/20220217002227.5739-5-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Aaron Lu <aaron.lu@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 22 Mar 2022 21:43:36 +0000 (14:43 -0700)]
mm/page_alloc: simplify how many pages are selected per pcp list during bulk free
free_pcppages_bulk() selects pages to free by round-robining between
lists. Originally this was to evenly shrink pages by migratetype but
uneven freeing is inevitable due to high pages. Simplify list selection
by starting with a list that definitely has pages on it in
free_unref_page_commit() and for drain, it does not matter where
draining starts as all pages are removed.
Link: https://lkml.kernel.org/r/20220217002227.5739-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Aaron Lu <aaron.lu@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 22 Mar 2022 21:43:33 +0000 (14:43 -0700)]
mm/page_alloc: track range of active PCP lists during bulk free
free_pcppages_bulk() frees pages in a round-robin fashion. Originally,
this was dealing only with migratetypes but storing high-order pages
means that there can be many more empty lists that are uselessly
checked. Track the minimum and maximum active pindex to reduce the
search space.
Link: https://lkml.kernel.org/r/20220217002227.5739-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Tested-by: Aaron Lu <aaron.lu@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mel Gorman [Tue, 22 Mar 2022 21:43:30 +0000 (14:43 -0700)]
mm/page_alloc: fetch the correct pcp buddy during bulk free
Patch series "Follow-up on high-order PCP caching", v2.
Commit f769c269399e ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists") was primarily aimed at reducing the cost of SLUB
cache refills of high-order pages in two ways. Firstly, zone lock
acquisitions was reduced and secondly, there were fewer buddy list
modifications. This is a follow-up series fixing some issues that
became apparant after merging.
Patch 1 is a functional fix. It's harmless but inefficient.
Patches 2-5 reduce the overhead of bulk freeing of PCP pages. While the
overhead is small, it's cumulative and noticable when truncating large
files. The changelog for patch 4 includes results of a microbench that
deletes large sparse files with data in page cache. Sparse files were
used to eliminate filesystem overhead.
Patch 6 addresses issues with high-order PCP pages being stored on PCP
lists for too long. Pages freed on a CPU potentially may not be quickly
reused and in some cases this can increase cache miss rates. Details
are included in the changelog.
This patch (of 6):
free_pcppages_bulk() prefetches buddies about to be freed but the order
must also be passed in as PCP lists store multiple orders.
Link: https://lkml.kernel.org/r/20220217002227.5739-1-mgorman@techsingularity.net Link: https://lkml.kernel.org/r/20220217002227.5739-2-mgorman@techsingularity.net Fixes: f769c269399e ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Aaron Lu <aaron.lu@intel.com> Tested-by: Aaron Lu <aaron.lu@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Alistair Popple [Tue, 22 Mar 2022 21:43:26 +0000 (14:43 -0700)]
mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node
ZONE_MOVABLE uses the remaining memory in each node. Its starting pfn
is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
not enough room for ZONE_MOVABLE on that node.
Unfortunately this condition is not checked for. This leads to
zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
node.
calculate_node_totalpages() then sets zone->present_pages to be greater
than zone->spanned_pages which is invalid, as spanned_pages represents
the maximum number of pages in a zone assuming no holes.
Subsequently it is possible free_area_init_core() will observe a zone of
size zero with present pages. In this case it will skip setting up the
zone, including the initialisation of free_lists[].
However populated_zone() checks zone->present_pages to see if a zone has
memory available. This is used by iterators such as
walk_zones_in_node(). pagetypeinfo_showfree() uses this to walk the
free_list of each zone in each node, which are assumed to be initialised
due to the zone not being empty.
As free_area_init_core() never initialised the free_lists[] this results
in the following kernel crash when trying to read /proc/pagetypeinfo:
Fix this by checking that the aligned zone_movable_pfn[] does not exceed
the end of the node, and if it does skip creating a movable zone on this
node.
Link: https://lkml.kernel.org/r/20220215025831.2113067-1-apopple@nvidia.com Fixes: b773eb8babd5 ("Create the ZONE_MOVABLE zone") Signed-off-by: Alistair Popple <apopple@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit c1ef52ef7c42 ("locking/local_lock: Make the empty local_lock_*()
function a macro.") in the -tip tree converted the local_lock_*()
functions into macros, which causes a warning with clang with
CONFIG_PREEMPT_RT=n + CONFIG_DEBUG_LOCK_ALLOC=n:
mm/page_alloc.c:131:40: error: variable 'pagesets' is not needed and will not be emitted [-Werror,-Wunneeded-internal-declaration]
static DEFINE_PER_CPU(struct pagesets, pagesets) = {
^
1 error generated.
Prior to that change, clang was not able to tell that pagesets was
unused in this configuration because it does not perform cross function
analysis in the frontend. After that change, it sees that the macros
just do a typecheck on the lock member of pagesets, which is evaluated
at compile time (so the variable is technically "used"), meaning the
variable is not needed in the final assembly, as the warning states.
Mark the variable as __maybe_unused to make it clear to clang that this
is expected in this configuration so there is no more warning.
Link: https://github.com/ClangBuiltLinux/linux/issues/1593 Link: https://lkml.kernel.org/r/20220215184322.440969-1-nathan@kernel.org Signed-off-by: Nathan Chancellor <nathan@kernel.org> Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Reported-by: "kernelci.org bot" <bot@kernelci.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Some places in the kernel don't really expect pageblock_order >=
MAX_ORDER, and it looks like this is only possible in corner cases:
1) CONFIG_DEFERRED_STRUCT_PAGE_INIT we'll end up freeing pageblock_order
pages via __free_pages_core(), which cannot possibly work.
2) find_zone_movable_pfns_for_nodes() will roundup the ZONE_MOVABLE
start PFN to MAX_ORDER_NR_PAGES. Consequently with a bigger
pageblock_order, we could have a single pageblock partially managed by
two zones.
3) compaction code runs into __fragmentation_index() with order
>= MAX_ORDER, when checking WARN_ON_ONCE(order >= MAX_ORDER). [1]
4) mm/page_reporting.c won't be reporting any pages with default
page_reporting_order == pageblock_order, as we'll be skipping the
reporting loop inside page_reporting_process_zone().
5) __rmqueue_fallback() will never be able to steal with
ALLOC_NOFRAGMENT.
pageblock_order >= MAX_ORDER is weird either way: it's a pure
optimization for making alloc_contig_range(), as used for allcoation of
gigantic pages, a little more reliable to succeed. However, if there is
demand for somewhat reliable allocation of gigantic pages, affected
setups should be using CMA or boottime allocations instead.
So let's make sure that pageblock_order < MAX_ORDER and simplify.
Link: https://lkml.kernel.org/r/20220214174132.219303-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Frank Rowand <frowand.list@gmail.com> Cc: John Garry via iommu <iommu@lists.linux-foundation.org> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Rob Herring <robh+dt@kernel.org> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Patch series "mm: enforce pageblock_order < MAX_ORDER".
Having pageblock_order >= MAX_ORDER seems to be able to happen in corner
cases and some parts of the kernel are not prepared for it.
For example, Aneesh has shown [1] that such kernels can be compiled on
ppc64 with 64k base pages by setting FORCE_MAX_ZONEORDER=8, which will
run into a WARN_ON_ONCE(order >= MAX_ORDER) in comapction code right
during boot.
We can get pageblock_order >= MAX_ORDER when the default hugetlb size is
bigger than the maximum allocation granularity of the buddy, in which
case we are no longer talking about huge pages but instead gigantic
pages.
Having pageblock_order >= MAX_ORDER can only make alloc_contig_range()
of such gigantic pages more likely to succeed.
Reliable use of gigantic pages either requires boot time allcoation or
CMA, no need to overcomplicate some places in the kernel to optimize for
corner cases that are broken in other areas of the kernel.
This patch (of 2):
Let's enforce pageblock_order < MAX_ORDER and simplify.
Especially patch #1 can be regarded a cleanup before:
[PATCH v5 0/6] Use pageblock_order for cma and alloc_contig_range
alignment. [2]
Link: https://lkml.kernel.org/r/20220214174132.219303-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Acked-by: Rob Herring <robh@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Frank Rowand <frowand.list@gmail.com> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Robin Murphy <robin.murphy@arm.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: John Garry via iommu <iommu@lists.linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Zi Yan [Tue, 22 Mar 2022 21:43:05 +0000 (14:43 -0700)]
mm: page_alloc: avoid merging non-fallbackable pageblocks with others
This is done in addition to MIGRATE_ISOLATE pageblock merge avoidance.
It prepares for the upcoming removal of the MAX_ORDER-1 alignment
requirement for CMA and alloc_contig_range().
MIGRATE_HIGHATOMIC should not merge with other migratetypes like
MIGRATE_ISOLATE and MIGRARTE_CMA[1], so this commit prevents that too.
Remove MIGRATE_CMA and MIGRATE_ISOLATE from fallbacks list, since they
are never used.
Link: https://lkml.kernel.org/r/20220124175957.2f7dee2-1-zi.yan@sent.com Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Mike Rapoport <rppt@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
That extra variable has been introduced just for keeping an original
passed gfp_mask because it is updated with __GFP_NOWARN on entry, thus
error handling messages were broken.
Instead we can keep an original gfp_mask without modifying it and add an
extra __GFP_NOWARN flag together with gfp_mask as a parameter to the
vm_area_alloc_pages() function. It will make it less confused.
Link: https://lkml.kernel.org/r/20220119143540.601149-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Uladzislau Rezki [Tue, 22 Mar 2022 21:42:53 +0000 (14:42 -0700)]
mm/vmalloc: add adjust_search_size parameter
Extend the find_vmap_lowest_match() function with one more parameter.
It is "adjust_search_size" boolean variable, so it is possible to
control an accuracy of search block if a specific alignment is required.
With this patch, a search size is always adjusted, to serve a request as
fast as possible because of performance reason.
But there is one exception though, it is short ranges where requested
size corresponds to passed vstart/vend restriction together with a
specific alignment request. In such scenario an adjustment wold not
lead to success allocation.
Link: https://lkml.kernel.org/r/20220119143540.601149-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki <uladzislau.rezki@sony.com> Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/vmalloc: Move draining areas out of caller context
A caller initiates the drain procces from its context once the
drain threshold is reached or passed. There are at least two
drawbacks of doing so:
a) a caller can be a high-prio or RT task. In that case it can
stuck in doing the actual drain of all lazily freed areas.
This is not optimal because such tasks usually are latency
sensitive where the control should be returned back as soon
as possible in order to drive such workloads in time. See ae10d79696ba ("mm/vmalloc: rework the drain logic")
b) It is not safe to call vfree() during holding a spinlock due
to the vmap_purge_lock mutex. The was a report about this from
Zeal Robot <zealci@zte.com.cn> here:
https://lore.kernel.org/all/20211222081026.484058-1-chi.minghao@zte.com.cn
Moving the drain to the separate work context addresses those
issues.
v1->v2:
- Added prefix "_work" to the drain worker function.
v2->v3:
- Remove the drain_vmap_work_in_progress. Extra queuing
is expectable under heavy load but it can be disregarded
because a work will bail out if nothing to be done.
Link: https://lkml.kernel.org/r/20220131144058.35608-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Uladzislau Rezki <uladzislau.rezki@sony.com> Cc: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Tue, 22 Mar 2022 21:42:39 +0000 (14:42 -0700)]
mm/mmap: remove obsolete comment in ksys_mmap_pgoff
RLIMIT_MEMLOCK is already reimplemented on top of ucounts now. And
since commit c046d24b77e3 ("mm,hugetlb: remove mlock ulimit for
SHM_HUGETLB"), mlock ulimit for SHM_HUGETLB is further removed.
_install_special_mapping() adds the VM_SPECIAL bit VM_DONTEXPAND (and
never attempts to update locked_vm), so it ought to be consistent with
mmap_region() and mlock_fixup(), making sure not to add VM_LOCKED or
VM_LOCKONFAULT. I doubt that this fixes any problem in practice: just
do it for consistency.
Randy Dunlap [Tue, 22 Mar 2022 21:42:27 +0000 (14:42 -0700)]
mm/mmap: return 1 from stack_guard_gap __setup() handler
__setup() handlers should return 1 if the command line option is handled
and 0 if not (or maybe never return 0; it just pollutes init's
environment). This prevents:
Unknown kernel command line parameters \
"BOOT_IMAGE=/boot/bzImage-517rc5 stack_guard_gap=100", will be \
passed to user space.
Run /sbin/init as init process
with arguments:
/sbin/init
with environment:
HOME=/
TERM=linux
BOOT_IMAGE=/boot/bzImage-517rc5
stack_guard_gap=100
Return 1 to indicate that the boot option has been handled.
Note that there is no warning message if someone enters:
stack_guard_gap=anything_invalid
and 'val' and stack_guard_gap are both set to 0 due to the use of
simple_strtoul(). This could be improved by using kstrtoxxx() and
checking for an error.
It appears that having stack_guard_gap == 0 is valid (if unexpected) since
using "stack_guard_gap=0" on the kernel command line does that.
Link: https://lkml.kernel.org/r/20220222005817.11087-1-rdunlap@infradead.org
Link: lore.kernel.org/r/64644a2f-4a20-bab3-1e15-3b2cdd0defe3@omprussia.ru Fixes: b6107de1ad627 ("mm: larger stack guard gap, between vmas") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: Igor Zhbanov <i.zhbanov@omprussia.ru> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Tue, 22 Mar 2022 21:42:21 +0000 (14:42 -0700)]
mm: change zap_details.zap_mapping into even_cows
Currently we have a zap_mapping pointer maintained in zap_details, when
it is specified we only want to zap the pages that has the same mapping
with what the caller has specified.
But what we want to do is actually simpler: we want to skip zapping
private (COW-ed) pages in some cases. We can refer to
unmap_mapping_pages() callers where we could have passed in different
even_cows values. The other user is unmap_mapping_folio() where we
always want to skip private pages.
According to Hugh, we used a mapping pointer for historical reason, as
explained here:
Which raises the question again of why I did not just use a boolean flag
there originally: aah, I think I've found why. In those days there was a
horrible "optimization", for better performance on some benchmark I guess,
which when you read from /dev/zero into a private mapping, would map the zero
page there (look up read_zero_pagealigned() and zeromap_page_range() if you
dare). So there was another category of page to be skipped along with the
anon COWs, and I didn't want multiple tests in the zap loop, so checking
check_mapping against page->mapping did both. I think nowadays you could do
it by checking for PageAnon page (or genuine swap entry) instead.
This patch replaces the zap_details.zap_mapping pointer into the even_cows
boolean, then we check it against PageAnon.
Link: https://lkml.kernel.org/r/20220216094810.60572-4-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Hugh Dickins <hughd@google.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Tue, 22 Mar 2022 21:42:15 +0000 (14:42 -0700)]
mm: don't skip swap entry even if zap_details specified
Patch series "mm: Rework zap ptes on swap entries", v5.
Patch 1 should fix a long standing bug for zap_pte_range() on
zap_details usage. The risk is we could have some swap entries skipped
while we should have zapped them.
Migration entries are not the major concern because file backed memory
always zap in the pattern that "first time without page lock, then
re-zap with page lock" hence the 2nd zap will always make sure all
migration entries are already recovered.
However there can be issues with real swap entries got skipped
errornoously. There's a reproducer provided in commit message of patch
1 for that.
Patch 2-4 are cleanups that are based on patch 1. After the whole
patchset applied, we should have a very clean view of zap_pte_range().
Only patch 1 needs to be backported to stable if necessary.
This patch (of 4):
The "details" pointer shouldn't be the token to decide whether we should
skip swap entries.
For example, when the callers specified details->zap_mapping==NULL, it
means the user wants to zap all the pages (including COWed pages), then
we need to look into swap entries because there can be private COWed
pages that was swapped out.
Skipping some swap entries when details is non-NULL may lead to wrongly
leaving some of the swap entries while we should have zapped them.
/* Write private page, swap it out */
buffer[page_size] = 1;
madvise(buffer, page_size * 2, MADV_PAGEOUT);
/* This should drop private buffer[page_size] already */
ret = ftruncate(shmem_fd, page_size);
assert(ret == 0);
/* Recover the size */
ret = ftruncate(shmem_fd, page_size * 2);
assert(ret == 0);
/* Re-read the data, it should be all zero */
val = buffer[page_size];
if (val == 0)
printf("Good\n");
else
printf("BUG\n");
}
===8<===
We don't need to touch up the pmd path, because pmd never had a issue with
swap entries. For example, shmem pmd migration will always be split into
pte level, and same to swapping on anonymous.
Add another helper should_zap_cows() so that we can also check whether we
should zap private mappings when there's no page pointer specified.
This patch drops that trick, so we handle swap ptes coherently. Meanwhile
we should do the same check upon migration entry, hwpoison entry and
genuine swap entries too.
To be explicit, we should still remember to keep the private entries if
even_cows==false, and always zap them when even_cows==true.
The issue seems to exist starting from the initial commit of git.
Muchun Song [Tue, 22 Mar 2022 21:42:08 +0000 (14:42 -0700)]
mm: userfaultfd: fix missing cache flush in mcopy_atomic_pte() and __mcopy_atomic()
userfaultfd calls mcopy_atomic_pte() and __mcopy_atomic() which do not
do any cache flushing for the target page. Then the target page will be
mapped to the user space with a different address (user address), which
might have an alias issue with the kernel address used to copy the data
from the user to. Fix this by insert flush_dcache_page() after
copy_from_user() succeeds.
Link: https://lkml.kernel.org/r/20220210123058.79206-7-songmuchun@bytedance.com Fixes: 657ea1bbd605 ("userfaultfd: avoid mmap_sem read recursion in mcopy_atomic") Fixes: 19f2658742d6 ("userfaultfd: mcopy_atomic|mfill_zeropage: UFFDIO_COPY|UFFDIO_ZEROPAGE preparation") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:42:05 +0000 (14:42 -0700)]
mm: shmem: fix missing cache flush in shmem_mfill_atomic_pte()
userfaultfd calls shmem_mfill_atomic_pte() which does not do any cache
flushing for the target page. Then the target page will be mapped to
the user space with a different address (user address), which might have
an alias issue with the kernel address used to copy the data from the
user to. Insert flush_dcache_page() in non-zero-page case. And replace
clear_highpage() with clear_user_highpage() which already considers the
cache maintenance.
Link: https://lkml.kernel.org/r/20220210123058.79206-6-songmuchun@bytedance.com Fixes: 46cbc25f174c ("userfaultfd: shmem: add shmem_mfill_zeropage_pte for userfaultfd support") Fixes: 205cbfcfa366 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:42:02 +0000 (14:42 -0700)]
mm: hugetlb: fix missing cache flush in hugetlb_mcopy_atomic_pte()
folio_copy() will copy the data from one page to the target page, then
the target page will be mapped to the user space address, which might
have an alias issue with the kernel address used to copy the data from
the page to. There are 2 ways to fix this issue.
1) insert flush_dcache_page() after folio_copy().
2) replace folio_copy() with copy_user_huge_page() which already
considers the cache maintenance.
We chose 2) way to fix the issue since architectures can optimize this
situation. It is also make backports easier.
Link: https://lkml.kernel.org/r/20220210123058.79206-5-songmuchun@bytedance.com Fixes: 42c135c8c354 ("mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:41:59 +0000 (14:41 -0700)]
mm: hugetlb: fix missing cache flush in copy_huge_page_from_user()
userfaultfd calls copy_huge_page_from_user() which does not do any cache
flushing for the target page. Then the target page will be mapped to
the user space with a different address (user address), which might have
an alias issue with the kernel address used to copy the data from the
user to.
Fix this issue by flushing dcache in copy_huge_page_from_user().
Link: https://lkml.kernel.org/r/20220210123058.79206-4-songmuchun@bytedance.com Fixes: 958a0e4f73c1 ("userfaultfd: hugetlbfs: add copy_huge_page_from_user for hugetlb userfaultfd support") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:41:56 +0000 (14:41 -0700)]
mm: fix missing cache flush for all tail pages of compound page
The D-cache maintenance inside move_to_new_page() only consider one
page, there is still D-cache maintenance issue for tail pages of
compound page (e.g. THP or HugeTLB).
THP migration is only enabled on x86_64, ARM64 and powerpc, while
powerpc and arm64 need to maintain the consistency between I-Cache and
D-Cache, which depends on flush_dcache_page() to maintain the
consistency between I-Cache and D-Cache.
But there is no issues on arm64 and powerpc since they already considers
the compound page cache flushing in their icache flush function.
HugeTLB migration is enabled on arm, arm64, mips, parisc, powerpc,
riscv, s390 and sh, while arm has handled the compound page cache flush
in flush_dcache_page(), but most others do not.
In theory, the issue exists on many architectures. Fix this by not
using flush_dcache_folio() since it is not backportable.
Link: https://lkml.kernel.org/r/20220210123058.79206-3-songmuchun@bytedance.com Fixes: d244743a82a7 ("hugetlb: hugepage migration core") Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Rientjes <rientjes@google.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Peter Xu <peterx@redhat.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:41:53 +0000 (14:41 -0700)]
mm: thp: fix wrong cache flush in remove_migration_pmd()
Patch series "Fix some cache flush bugs", v5.
This series focuses on fixing cache maintenance.
This patch (of 7):
The flush_cache_range() is supposed to be justified only if the page is
already placed in process page table, and that is done right after
flush_cache_range(). So using this interface is wrong. And there is no
need to invalite cache since it was non-present before in
remove_migration_pmd(). So just to remove it.
Link: https://lkml.kernel.org/r/20220210123058.79206-1-songmuchun@bytedance.com Link: https://lkml.kernel.org/r/20220210123058.79206-2-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Lars Persson <lars.persson@axis.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Stafford Horne [Tue, 22 Mar 2022 21:41:50 +0000 (14:41 -0700)]
mm: remove mmu_gathers storage from remaining architectures
Originally the mmu_gathers were removed in commit 0b0fae009ce3 ("mm: now
that all old mmu_gather code is gone, remove the storage"). However,
the openrisc and hexagon architecture were merged around the same time
and mmu_gathers was not removed.
This patch removes them from openrisc, hexagon and nds32:
Noticed while cleaning this warning:
arch/openrisc/mm/init.c:41:1: warning: symbol 'mmu_gathers' was not declared. Should it be static?
Link: https://lkml.kernel.org/r/20220205141956.3315419-1-shorne@gmail.com Signed-off-by: Stafford Horne <shorne@gmail.com> Acked-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Brian Cain <bcain@codeaurora.org> Cc: Nick Hu <nickhu@andestech.com> Cc: Greentime Hu <green.hu@gmail.com> Cc: Vincent Chen <deanbo422@gmail.com> Cc: Jonas Bonn <jonas@southpole.se> Cc: Stefan Kristiansson <stefan.kristiansson@saunalahti.fi> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: David Hildenbrand <david@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: merge pte_mkhuge() call into arch_make_huge_pte()
Each call into pte_mkhuge() is invariably followed by
arch_make_huge_pte(). Instead arch_make_huge_pte() can accommodate
pte_mkhuge() at the beginning. This updates generic fallback stub for
arch_make_huge_pte() and available platforms definitions. This makes huge
pte creation much cleaner and easier to follow.
Link: https://lkml.kernel.org/r/1643860669-26307-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Mackerras <paulus@samba.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Guillaume Tucker [Tue, 22 Mar 2022 21:41:44 +0000 (14:41 -0700)]
selftests, x86: fix how check_cc.sh is being invoked
The $(CC) variable used in Makefiles could contain several arguments
such as "ccache gcc". These need to be passed as a single string to
check_cc.sh, otherwise only the first argument will be used as the
compiler command. Without quotes, the $(CC) variable is passed as
distinct arguments which causes the script to fail to build trivial
programs.
Fix this by adding quotes around $(CC) when calling check_cc.sh to pass
the whole string as a single argument to the script even if it has
several words such as "ccache gcc".
Vasily Averin [Tue, 22 Mar 2022 21:41:41 +0000 (14:41 -0700)]
memcg: enable accounting for tty-related objects
At each login the user forces the kernel to create a new terminal and
allocate up to ~1Kb memory for the tty-related structures.
By default it's allowed to create up to 4096 ptys with 1024 reserve for
initial mount namespace only and the settings are controlled by host
admin.
Though this default is not enough for hosters with thousands of
containers per node. Host admin can be forced to increase it up to
NR_UNIX98_PTY_MAX = 1<<20.
By default container is restricted by pty mount_opt.max = 1024, but
admin inside container can change it via remount. As a result, one
container can consume almost all allowed ptys and allocate up to 1Gb of
unaccounted memory.
It is not enough per-se to trigger OOM on host, however anyway, it
allows to significantly exceed the assigned memcg limit and leads to
troubles on the over-committed node.
It makes sense to account for them to restrict the host's memory
consumption from inside the memcg-limited container.
Link: https://lkml.kernel.org/r/5d4bca06-7d4f-a905-e518-12981ebca1b3@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Slaby <jirislaby@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:41:38 +0000 (14:41 -0700)]
mm: memcontrol: rename memcg_cache_id to memcg_kmem_id
The memcg_cache_id() introduced by commit 860ac54f4107 ("slab/slub:
consider a memcg parameter in kmem_create_cache") is used to index in the
kmem_cache->memcg_params->memcg_caches array. Since
kmem_cache->memcg_params.memcg_caches has been removed by commit 967868c2f06d ("mm: memcg/slab: use a single set of kmem_caches for all
accounted allocations"). So the name does not need to reflect cache
related. Just rename it to memcg_kmem_id. And it can reflect kmem
related.
Link: https://lkml.kernel.org/r/20220228122126.37293-17-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Tue, 22 Mar 2022 21:41:31 +0000 (14:41 -0700)]
mm: memcontrol: fix cannot alloc the maximum memcg ID
The idr_alloc() does not include @max ID. So in the current
implementation, the maximum memcg ID is 65534 instead of 65535. It
seems a bug. So fix this.
Link: https://lkml.kernel.org/r/20220228122126.37293-15-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Alex Shi <alexs@kernel.org> Cc: Anna Schumaker <Anna.Schumaker@Netapp.com> Cc: Chao Yu <chao@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Fam Zheng <fam.zheng@bytedance.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kari Argillander <kari.argillander@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qi Zheng <zhengqi.arch@bytedance.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeelb@google.com> Cc: Theodore Ts'o <tytso@mit.edu> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>