Baolin Wang [Fri, 5 Nov 2021 20:41:55 +0000 (13:41 -0700)]
hugetlb: replace the obsolete hugetlb_instantiation_mutex in the comments
After commit 322f8dae02b8 ("mm, hugetlb: improve page-fault
scalability"), the hugetlb_instantiation_mutex lock had been replaced by
hugetlb_fault_mutex_table to serializes faults on the same logical page.
Thus update the obsolete hugetlb_instantiation_mutex related comments.
Baolin Wang [Fri, 5 Nov 2021 20:41:46 +0000 (13:41 -0700)]
hugetlb: support node specified when using cma for gigantic hugepages
Now the size of CMA area for gigantic hugepages runtime allocation is
balanced for all online nodes, but we also want to specify the size of
CMA per-node, or only one node in some cases, which are similar with
patch [1].
For example, on some multi-nodes systems, each node's memory can be
different, allocating the same size of CMA for each node is not suitable
for the low-memory nodes. Meanwhile some workloads like DPDK mentioned
by Zhenguo in patch [1] only need hugepages in one node.
On the other hand, we have some machines with multiple types of memory,
like DRAM and PMEM (persistent memory). On this system, we may want to
specify all the hugepages only on DRAM node, or specify the proportion
of DRAM node and PMEM node, to tuning the performance of the workloads.
Thus this patch adds node format for 'hugetlb_cma' parameter to support
specifying the size of CMA per-node. An example is as follows:
hugetlb_cma=0:5G,2:5G
which means allocating 5G size of CMA area on node 0 and node 2
respectively. And the users should use the node specific sysfs file to
allocate the gigantic hugepages if specified the CMA size on that node.
Mina Almasry [Fri, 5 Nov 2021 20:41:40 +0000 (13:41 -0700)]
mm, hugepages: add mremap() support for hugepage backed vma
Support mremap() for hugepage backed vma segment by simply repositioning
page table entries. The page table entries are repositioned to the new
virtual address on mremap().
Hugetlb mremap() support is of course generic; my motivating use case is
a library (hugepage_text), which reloads the ELF text of executables in
hugepages. This significantly increases the execution performance of
said executables.
Restrict the mremap operation on hugepages to up to the size of the
original mapping as the underlying hugetlb reservation is not yet
capable of handling remapping to a larger size.
During the mremap() operation we detect pmd_share'd mappings and we
unshare those during the mremap(). On access and fault the sharing is
established again.
Link: https://lkml.kernel.org/r/20211013195825.3058275-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Ken Chen <kenchen@google.com> Cc: Chris Kennelly <ckennelly@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Kirill Shutemov <kirill@shutemov.name> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 5 Nov 2021 20:41:33 +0000 (13:41 -0700)]
hugetlb: add hugetlb demote page support
Demote page functionality will split a huge page into a number of huge
pages of a smaller size. For example, on x86 a 1GB huge page can be
demoted into 512 2M huge pages. Demotion is done 'in place' by simply
splitting the huge page.
Added '*_for_demote' wrappers for remove_hugetlb_page,
destroy_compound_hugetlb_page and prep_compound_gigantic_page for use by
demote code.
[mike.kravetz@oracle.com: v4] Link: https://lkml.kernel.org/r/6ca29b8e-527c-d6ec-900e-e6a43e4f8b73@oracle.com Link: https://lkml.kernel.org/r/20211007181918.136982-6-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Nghia Le <nghialm78@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 5 Nov 2021 20:41:30 +0000 (13:41 -0700)]
hugetlb: add demote bool to gigantic page routines
The routines remove_hugetlb_page and destroy_compound_gigantic_page will
remove a gigantic page and make the set of base pages ready to be
returned to a lower level allocator. In the process of doing this, they
make all base pages reference counted.
The routine prep_compound_gigantic_page creates a gigantic page from a
set of base pages. It assumes that all these base pages are reference
counted.
During demotion, a gigantic page will be split into huge pages of a
smaller size. This logically involves use of the routines,
remove_hugetlb_page, and destroy_compound_gigantic_page followed by
prep_compound*_page for each smaller huge page.
When pages are reference counted (ref count >= 0), additional
speculative ref counts could be taken as described in previous commits
[1] and [2]. This could result in errors while demoting a huge page.
Quite a bit of code would need to be created to handle all possible
issues.
Instead of dealing with the possibility of speculative ref counts, avoid
the possibility by keeping ref counts at zero during the demote process.
Add a boolean 'demote' to the routines remove_hugetlb_page,
destroy_compound_gigantic_page and prep_compound_gigantic_page. If the
boolean is set, the remove and destroy routines will not reference count
pages and the prep routine will not expect reference counted pages.
'*_for_demote' wrappers of the routines will be added in a subsequent
patch where this functionality is used.
Link: https://lkml.kernel.org/r/20211007181918.136982-5-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Nghia Le <nghialm78@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 5 Nov 2021 20:41:27 +0000 (13:41 -0700)]
hugetlb: be sure to free demoted CMA pages to CMA
When huge page demotion is fully implemented, gigantic pages can be
demoted to a smaller huge page size. For example, on x86 a 1G page can
be demoted to 512 2M pages. However, gigantic pages can potentially be
allocated from CMA. If a gigantic page which was allocated from CMA is
demoted, the corresponding demoted pages needs to be returned to CMA.
Use the new interface cma_pages_valid() to determine if a non-gigantic
hugetlb page should be freed to CMA. Also, clear mapping field of these
pages as expected by cma_release.
This also requires a change to CMA region creation for gigantic pages.
CMA uses a per-region bit map to track allocations. When setting up the
region, you specify how many pages each bit represents. Currently, only
gigantic pages are allocated/freed from CMA so the region is set up such
that one bit represents a gigantic page size allocation.
With demote, a gigantic page (allocation) could be split into smaller
size pages. And, these smaller size pages will be freed to CMA. So,
since the per-region bit map needs to be set up to represent the
smallest allocation/free size, it now needs to be set to the smallest
huge page size which can be freed to CMA.
Unfortunately, we set up the CMA region for huge pages before we set up
huge pages sizes (hstates). So, technically we do not know the smallest
huge page size as this can change via command line options and
architecture specific code. Therefore, at region setup time we use
HUGETLB_PAGE_ORDER as the smallest possible huge page size that can be
given back to CMA. It is possible that this value is sub-optimal for
some architectures/config options. If needed, this can be addressed in
follow on work.
Link: https://lkml.kernel.org/r/20211007181918.136982-4-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Nghia Le <nghialm78@gmail.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 5 Nov 2021 20:41:23 +0000 (13:41 -0700)]
mm/cma: add cma_pages_valid to determine if pages are in CMA
Add new interface cma_pages_valid() which indicates if the specified
pages are part of a CMA region. This interface will be used in a
subsequent patch by hugetlb code.
In order to keep the same amount of DEBUG information, a pr_debug() call
was added to cma_pages_valid(). In the case where the page passed to
cma_release is not in cma region, the debug message will be printed from
cma_pages_valid as opposed to cma_release.
Link: https://lkml.kernel.org/r/20211007181918.136982-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Nghia Le <nghialm78@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Fri, 5 Nov 2021 20:41:20 +0000 (13:41 -0700)]
hugetlb: add demote hugetlb page sysfs interfaces
Patch series "hugetlb: add demote/split page functionality", v4.
The concurrent use of multiple hugetlb page sizes on a single system is
becoming more common. One of the reasons is better TLB support for
gigantic page sizes on x86 hardware. In addition, hugetlb pages are
being used to back VMs in hosting environments.
When using hugetlb pages to back VMs, it is often desirable to
preallocate hugetlb pools. This avoids the delay and uncertainty of
allocating hugetlb pages at VM startup. In addition, preallocating huge
pages minimizes the issue of memory fragmentation that increases the
longer the system is up and running.
In such environments, a combination of larger and smaller hugetlb pages
are preallocated in anticipation of backing VMs of various sizes. Over
time, the preallocated pool of smaller hugetlb pages may become depleted
while larger hugetlb pages still remain. In such situations, it is
desirable to convert larger hugetlb pages to smaller hugetlb pages.
Converting larger to smaller hugetlb pages can be accomplished today by
first freeing the larger page to the buddy allocator and then allocating
the smaller pages. For example, to convert 50 GB pages on x86:
On an idle system this operation is fairly reliable and results are as
expected. The number of 2MB pages is increased as expected and the time
of the operation is a second or two.
However, when there is activity on the system the following issues
arise:
1) This process can take quite some time, especially if allocation of
the smaller pages is not immediate and requires migration/compaction.
2) There is no guarantee that the total size of smaller pages allocated
will match the size of the larger page which was freed. This is
because the area freed by the larger page could quickly be
fragmented.
In a test environment with a load that continually fills the page cache
with clean pages, results such as the following can be observed:
Unexpected number of 2MB pages allocated: Expected 25600, have 19944
real 0m42.092s
user 0m0.008s
sys 0m41.467s
To address these issues, introduce the concept of hugetlb page demotion.
Demotion provides a means of 'in place' splitting of a hugetlb page to
pages of a smaller size. This avoids freeing pages to buddy and then
trying to allocate from buddy.
Page demotion is controlled via sysfs files that reside in the per-hugetlb
page size and per node directories.
- demote_size
Target page size for demotion, a smaller huge page size. File
can be written to chose a smaller huge page size if multiple are
available.
- demote
Writable number of hugetlb pages to be demoted
Only hugetlb pages which are free at the time of the request can be
demoted. Demotion does not add to the complexity of surplus pages and
honors reserved huge pages. Therefore, when a value is written to the
sysfs demote file, that value is only the maximum number of pages which
will be demoted. It is possible fewer will actually be demoted. The
recently introduced per-hstate mutex is used to synchronize demote
operations with other operations that modify hugetlb pools.
Real world use cases
--------------------
The above scenario describes a real world use case where hugetlb pages
are used to back VMs on x86. Both issues of long allocation times and
not necessarily getting the expected number of smaller huge pages after
a free and allocate cycle have been experienced. The occurrence of
these issues is dependent on other activity within the host and can not
be predicted.
This patch (of 5):
Two new sysfs files are added to demote hugtlb pages. These files are
both per-hugetlb page size and per node. Files are:
demote_size - The size in Kb that pages are demoted to. (read-write)
demote - The number of huge pages to demote. (write-only)
By default, demote_size is the next smallest huge page size. Valid huge
page sizes less than huge page size may be written to this file. When
huge pages are demoted, they are demoted to this size.
Writing a value to demote will result in an attempt to demote that
number of hugetlb pages to an appropriate number of demote_size pages.
NOTE: Demote interfaces are only provided for huge page sizes if there
is a smaller target demote huge page size. For example, on x86 1GB huge
pages will have demote interfaces. 2MB huge pages will not have demote
interfaces.
This patch does not provide full demote functionality. It only provides
the sysfs interfaces.
It also provides documentation for the new interfaces.
[mike.kravetz@oracle.com: n_mask initialization does not need to be protected by the mutex] Link: https://lkml.kernel.org/r/0530e4ef-2492-5186-f919-5db68edea654@oracle.com Link: https://lkml.kernel.org/r/20211007181918.136982-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: David Rientjes <rientjes@google.com> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Nghia Le <nghialm78@gmail.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 5 Nov 2021 20:41:17 +0000 (13:41 -0700)]
mm/hugetlb: drop __unmap_hugepage_range definition from hugetlb.h
Remove __unmap_hugepage_range() from the header file, because it is only
used in hugetlb.c.
Link: https://lkml.kernel.org/r/20210917165108.9341-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Fri, 5 Nov 2021 20:41:14 +0000 (13:41 -0700)]
mm: hwpoison: handle non-anonymous THP correctly
Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
support for tmpfs and read-only file cache has been added. They could
be offlined by split THP, just like anonymous THP.
Link: https://lkml.kernel.org/r/20211020210755.23964-7-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Fri, 5 Nov 2021 20:41:10 +0000 (13:41 -0700)]
mm: shmem: don't truncate page if memory failure happens
The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean. If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users. This may cause silent data loss. It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks. The later read would return all zero.
The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed. The regular
storage backed filesystems would be more complicated so this patch is
focused on shmem. This also unblock the support for soft offlining
shmem THP.
[arnd@arndb.de: fix uninitialized variable use in me_pagecache_clean()] Link: https://lkml.kernel.org/r/20211022064748.4173718-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20211020210755.23964-6-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Fri, 5 Nov 2021 20:41:07 +0000 (13:41 -0700)]
mm: hwpoison: refactor refcount check handling
Memory failure will report failure if the page still has extra pinned
refcount other than from hwpoison after the handler is done. Actually
the check is not necessary for all handlers, so move the check into
specific handlers. This would make the following keeping shmem page in
page cache patch easier.
There may be expected extra pin for some cases, for example, when the
page is dirty and in swapcache.
Link: https://lkml.kernel.org/r/20211020210755.23964-5-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Yang Shi [Fri, 5 Nov 2021 20:41:04 +0000 (13:41 -0700)]
mm: filemap: coding style cleanup for filemap_map_pmd()
Patch series "Solve silent data loss caused by poisoned page cache (shmem/tmpfs)", v5.
When discussing the patch that splits page cache THP in order to offline
the poisoned page, Noaya mentioned there is a bigger problem [1] that
prevents this from working since the page cache page will be truncated
if uncorrectable errors happen. By looking this deeper it turns out
this approach (truncating poisoned page) may incur silent data loss for
all non-readonly filesystems if the page is dirty. It may be worse for
in-memory filesystem, e.g. shmem/tmpfs since the data blocks are
actually gone.
To solve this problem we could keep the poisoned dirty page in page
cache then notify the users on any later access, e.g. page fault,
read/write, etc. The clean page could be truncated as is since they can
be reread from disk later on.
The consequence is the filesystems may find poisoned page and manipulate
it as healthy page since all the filesystems actually don't check if the
page is poisoned or not in all the relevant paths except page fault. In
general, we need make the filesystems be aware of poisoned page before
we could keep the poisoned page in page cache in order to solve the data
loss problem.
To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could
prevent from writeback.
- The page should not be dropped (it shows as a clean page) by drop
caches or other callers: the refcount pin from hwpoison could prevent
from invalidating (called by cache drop, inode cache shrinking, etc),
but it doesn't avoid invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it
works as it is.
- Notify users when the page is accessed, e.g. read/write, page fault
and other paths (compression, encryption, etc).
The scope of the last one is huge since almost all filesystems need do
it once a page is returned from page cache lookup. There are a couple
of options to do it:
1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most
callsites check if NULL is returned, this should have least work I
think. But the error handling in filesystems just return -ENOMEM,
the error code will incur confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO),
but this will involve significant amount of code change as well
since all the paths need check if the pointer is ERR or not just
like option #1.
I did prototypes for both #1 and #3, but it seems #3 may require more
changes than #1. For #3 ERR_PTR will be returned so all the callers
need to check the return value otherwise invalid pointer may be
dereferenced, but not all callers really care about the content of the
page, for example, partial truncate which just sets the truncated range
in one page to 0. So for such paths it needs additional modification if
ERR_PTR is returned. And if the callers have their own way to handle
the problematic pages we need to add a new FGP flag to tell FGP
functions to return the pointer to the page.
It may happen very rarely, but once it happens the consequence (data
corruption) could be very bad and it is very hard to debug. It seems
this problem had been slightly discussed before, but seems no action was
taken at that time. [2]
As the aforementioned investigation, it needs huge amount of work to
solve the potential data loss for all filesystems. But it is much
easier for in-memory filesystems and such filesystems actually suffer
more than others since even the data blocks are gone due to truncating.
So this patchset starts from shmem/tmpfs by taking option #1.
TODO:
* The unpoison has been broken since commit 1a1510866cbe ("mm,hwpoison: make
get_hwpoison_page() call get_any_page()"), and this patch series make
refcount check for unpoisoning shmem page fail.
* Expand to other filesystems. But I haven't heard feedback from filesystem
developers yet.
Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
add page cache THP split support.
This patch (of 4):
A minor cleanup to the indent.
Link: https://lkml.kernel.org/r/20211020210755.23964-1-shy828301@gmail.com Link: https://lkml.kernel.org/r/20211020210755.23964-4-shy828301@gmail.com Signed-off-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oscar Salvador <osalvador@suse.de> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: page_alloc: use migrate_disable() in drain_local_pages_wq()
drain_local_pages_wq() disables preemption to avoid CPU migration during
CPU hotplug and can't use cpus_read_lock().
Using migrate_disable() works here, too. The scheduler won't take the
CPU offline until the task left the migrate-disable section. The
problem with disabled preemption here is that drain_local_pages()
acquires locks which are turned into sleeping locks on PREEMPT_RT and
can't be acquired with disabled preemption.
Use migrate_disable() in drain_local_pages_wq().
Link: https://lkml.kernel.org/r/20211015210933.viw6rjvo64qtqxn4@linutronix.de Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: make generic arch_is_kernel_initmem_freed() do what it says
Commit d8b4fffadf8a ("locking/lockdep: check for freed initmem in
static_obj()") added arch_is_kernel_initmem_freed() which is supposed to
report whether an object is part of already freed init memory.
For the time being, the generic version of
arch_is_kernel_initmem_freed() always reports 'false', allthough
free_initmem() is generically called on all architectures.
Therefore, change the generic version of arch_is_kernel_initmem_freed()
to check whether free_initmem() has been called. If so, then check if a
given address falls into init memory.
To ease the use of system_state, move it out of line into its only
caller which is lockdep.c
Link: https://lkml.kernel.org/r/1d40783e676e07858be97d881f449ee7ea8adfb1.1633001016.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: create a new system state and fix core_kernel_text()
core_kernel_text() considers that until system_state in at least
SYSTEM_RUNNING, init memory is valid.
But init memory is freed a few lines before setting SYSTEM_RUNNING, so
we have a small period of time when core_kernel_text() is wrong.
Create an intermediate system state called SYSTEM_FREEING_INIT that is
set before starting freeing init memory, and use it in
core_kernel_text() to report init memory invalid earlier.
Link: https://lkml.kernel.org/r/9ecfdee7dd4d741d172cb93ff1d87f1c58127c9a.1633001016.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@ozlabs.org> Cc: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kswapd is frequently woken up due to the increase of
min/low/high_wmark_pages, printing watermark_boost can quickly locate
whether watermark_boost or _watermark[WMARK_MIN/LOW/HIGH] caused
min/low/high_wmark_pages to increase.
Feng Tang [Fri, 5 Nov 2021 20:40:34 +0000 (13:40 -0700)]
mm/page_alloc: detect allocation forbidden by cpuset and bail out early
There was a report that starting an Ubuntu in docker while using cpuset
to bind it to movable nodes (a node only has movable zone, like a node
for hotplug or a Persistent Memory node in normal usage) will fail due
to memory allocation failure, and then OOM is involved and many other
innocent processes got killed.
oom-kill:constraint=CONSTRAINT_CPUSET,nodemask=(null),cpuset=docker-9976a269caec812c134fa317f27487ee36e1129beba7278a463dd53e5fb9997b.scope,mems_allowed=4,global_oom,task_memcg=/system.slice/containerd.service,task=containerd,pid=4100,uid=0
Out of memory: Killed process 4100 (containerd) total-vm:4077036kB, anon-rss:51184kB, file-rss:26016kB, shmem-rss:0kB, UID:0 pgtables:676kB oom_score_adj:0
oom_reaper: reaped process 8248 (docker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 2054 (node_exporter), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 1452 (systemd-journal), now anon-rss:0kB, file-rss:8564kB, shmem-rss:4kB
oom_reaper: reaped process 2146 (munin-node), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
oom_reaper: reaped process 8291 (runc:[2:INIT]), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The reason is that in this case, the target cpuset nodes only have
movable zone, while the creation of an OS in docker sometimes needs to
allocate memory in non-movable zones (dma/dma32/normal) like
GFP_HIGHUSER, and the cpuset limit forbids the allocation, then
out-of-memory killing is involved even when normal nodes and movable
nodes both have many free memory.
The OOM killer cannot help to resolve the situation as there is no
usable memory for the request in the cpuset scope. The only reasonable
measure to take is to fail the allocation right away and have the caller
to deal with it.
So add a check for cases like this in the slowpath of allocation, and
bail out early returning NULL for the allocation.
As page allocation is one of the hottest path in kernel, this check will
hurt all users with sane cpuset configuration, add a static branch check
and detect the abnormal config in cpuset memory binding setup so that
the extra check cost in page allocation is not paid by everyone.
[thanks to Micho Hocko and David Rientjes for suggesting not handling
it inside OOM code, adding cpuset check, refining comments]
Link: https://lkml.kernel.org/r/1632481657-68112-1-git-send-email-feng.tang@intel.com Signed-off-by: Feng Tang <feng.tang@intel.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Dumazet [Fri, 5 Nov 2021 20:40:31 +0000 (13:40 -0700)]
mm/page_alloc.c: do not acquire zone lock in is_free_buddy_page()
Grabbing zone lock in is_free_buddy_page() gives a wrong sense of
safety, and has potential performance implications when zone is
experiencing lock contention.
In any case, if a caller needs a stable result, it should grab zone lock
before calling this function.
mm: move fold_vm_numa_events() to fix NUMA without SMP
If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):
sh4-linux-gnu-ld: mm/vmstat.o: in function `vmstat_start': vmstat.c:(.text+0x97c): undefined reference to `fold_vm_numa_events'
sh4-linux-gnu-ld: drivers/base/node.o: in function `node_read_vmstat': node.c:(.text+0x140): undefined reference to `fold_vm_numa_events'
sh4-linux-gnu-ld: drivers/base/node.o: in function `node_read_numastat': node.c:(.text+0x1d0): undefined reference to `fold_vm_numa_events'
Fix this by moving fold_vm_numa_events() outside the SMP-only section.
Link: https://lkml.kernel.org/r/9d16ccdd9ef32803d7100c84f737de6a749314fb.1631781495.git.geert+renesas@glider.be Fixes: fab9377135475346 ("mm/vmstat: convert NUMA statistics to basic NUMA counters") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Gon Solo <gonsolo@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Rich Felker <dalias@libc.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.osdn.me> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: move node_reclaim_distance to fix NUMA without SMP
Patch series "Fix NUMA without SMP".
SuperH is the only architecture which still supports NUMA without SMP,
for good reasons (various memories scattered around the address space,
each with varying latencies).
This series fixes two build errors due to variables and functions used
by the NUMA code being provided by SMP-only source files or sections.
This patch (of 2):
If CONFIG_NUMA=y, but CONFIG_SMP=n (e.g. sh/migor_defconfig):
sh4-linux-gnu-ld: mm/page_alloc.o: in function `get_page_from_freelist':
page_alloc.c:(.text+0x2c24): undefined reference to `node_reclaim_distance'
Fix this by moving the declaration of node_reclaim_distance from an
SMP-only to a generic file.
Link: https://lkml.kernel.org/r/cover.1631781495.git.geert+renesas@glider.be Link: https://lkml.kernel.org/r/6432666a648dde85635341e6c918cee97c97d264.1631781495.git.geert+renesas@glider.be Fixes: c19e736a8625c493 ("sched/topology: Improve load balancing on AMD EPYC systems") Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Suggested-by: Matt Fleming <matt@codeblueprint.co.uk> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.osdn.me> Cc: Rich Felker <dalias@libc.org> Cc: Gon Solo <gonsolo@gmail.com> Cc: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/page_alloc: use accumulated load when building node fallback list
In build_zonelists(), when the fallback list is built for the nodes, the
node load gets reinitialized during each iteration. This results in
nodes with same distances occupying the same slot in different node
fallback lists rather than appearing in the intended round- robin
manner. This results in one node getting picked for allocation more
compared to other nodes with the same distance.
As an example, consider a 4 node system with the following distance
matrix.
In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
same order which results in more allocations getting satisfied from node
0 compared to node 1.
The effect of this on remote memory bandwidth as seen by stream
benchmark is shown below:
Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
----------------------------------------
BANDWIDTH (MB/s)
TEST Case 1 Case 2
----------------------------------------
COPY 57479.6 110791.8
SCALE 55372.9 105685.9
ADD 50460.6 96734.2
TRIADD 50397.6 97119.1
----------------------------------------
The bandwidth drop in Case 1 occurs because most of the allocations get
satisfied by node 0 as it appears first in the fallback order for both
nodes 2 and 3.
This can be fixed by accumulating the node load in build_zonelists()
rather than reinitializing it during each iteration. With this the
nodes with the same distance rightly get assigned in the round robin
manner.
In fact this was how it was originally until commit d17917436799
("change zonelist order: zonelist order selection logic") dropped the
load accumulation and resorted to initializing the load during each
iteration.
While zonelist ordering was removed by commit 3742e7e052db ("mm,
page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
accumulation in build_zonelists() remained. So essentially this patch
reverts back to the accumulated node load logic.
After this fix, the fallback order gets built like this:
Node Fallback list
------------------
0 0 1 2 3
1 1 0 3 2
2 2 3 0 1
3 3 2 1 0 <-- Note the change here
The bandwidth in Case 1 improves and matches Case 2 as shown below.
----------------------------------------
BANDWIDTH (MB/s)
TEST Case 1 Case 2
----------------------------------------
COPY 110438.9 110107.2
SCALE 105930.5 105817.5
ADD 97005.1 96159.8
TRIADD 97441.5 96757.1
----------------------------------------
The correctness of the fallback list generation has been verified for
the above node configuration where the node 3 starts as memory-less node
and comes up online only during memory hotplug.
[bharata@amd.com: Added changelog, review, test validation]
Bharata B Rao [Fri, 5 Nov 2021 20:40:18 +0000 (13:40 -0700)]
mm/page_alloc: print node fallback order
Patch series "Fix NUMA nodes fallback list ordering".
For a NUMA system that has multiple nodes at same distance from other
nodes, the fallback list generation prefers same node order for them
instead of round-robin thereby penalizing one node over others. This
series fixes it.
More description of the problem and the fix is present in the patch
description.
This patch (of 2):
Print information message about the allocation fallback order for each
NUMA node during boot.
No functional changes here. This makes it easier to illustrate the
problem in the node fallback list generation, which the next patch
fixes.
Miaohe Lin [Fri, 5 Nov 2021 20:40:15 +0000 (13:40 -0700)]
mm/page_alloc.c: avoid allocating highmem pages via alloc_pages_exact[_nid]
Don't use with __GFP_HIGHMEM because page_address() cannot represent
highmem pages without kmap(). Newly allocated pages would leak as
page_address() will return NULL for highmem pages here. But It works
now because the callers do not specify __GFP_HIGHMEM now.
Link: https://lkml.kernel.org/r/20210902121242.41607-6-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Fri, 5 Nov 2021 20:40:08 +0000 (13:40 -0700)]
mm/page_alloc.c: fix obsolete comment in free_pcppages_bulk()
The second two paragraphs about "all pages pinned" and pages_scanned is
obsolete. And There are PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP orders
in pcp. So the same order assumption is not held now.
Link: https://lkml.kernel.org/r/20210902121242.41607-4-linmiaohe@huawei.com Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: David Hildenbrand <david@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miaohe Lin [Fri, 5 Nov 2021 20:40:02 +0000 (13:40 -0700)]
mm/page_alloc.c: remove meaningless VM_BUG_ON() in pindex_to_order()
Patch series "Cleanups and fixup for page_alloc", v2.
This series contains cleanups to remove meaningless VM_BUG_ON(), use
helpers to simplify the code and remove obsolete comment. Also we avoid
allocating highmem pages via alloc_pages_exact[_nid]. More details can be
found in the respective changelogs.
This patch (of 5):
It's meaningless to VM_BUG_ON() order != pageblock_order just after
setting order to pageblock_order. Remove it.
Chen Wandun [Fri, 5 Nov 2021 20:39:53 +0000 (13:39 -0700)]
mm/vmalloc: introduce alloc_pages_bulk_array_mempolicy to accelerate memory allocation
Commit 01174ecdda59 ("mm/vmalloc: fix numa spreading for large hash
tables") can cause significant performance regressions in some
situations as Andrew mentioned in [1]. The main situation is vmalloc,
vmalloc will allocate pages with NUMA_NO_NODE by default, that will
result in alloc page one by one;
In order to solve this, __alloc_pages_bulk and mempolicy should be
considered at the same time.
1) If node is specified in memory allocation request, it will alloc all
pages by __alloc_pages_bulk.
2) If interleaving allocate memory, it will cauculate how many pages
should be allocated in each node, and use __alloc_pages_bulk to alloc
pages in each node.
Michal Hocko [Fri, 5 Nov 2021 20:39:50 +0000 (13:39 -0700)]
mm/vmalloc: be more explicit about supported gfp flags
The core of the vmalloc allocator __vmalloc_area_node doesn't say
anything about gfp mask argument. Not all gfp flags are supported
though. Be more explicit about constraints.
Link: https://lkml.kernel.org/r/20211020082545.4830-1-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Neil Brown <neilb@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Ilya Dryomov <idryomov@gmail.com> Cc: Jeff Layton <jlayton@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The vm area used in vm_area_register_early() has no kasan shadow memory,
Let's add a new kasan_populate_early_vm_area_shadow() function to
populate the vm area shadow memory to fix the issue.
Kefeng Wang [Fri, 5 Nov 2021 20:39:44 +0000 (13:39 -0700)]
arm64: support page mapping percpu first chunk allocator
Percpu embedded first chunk allocator is the firstly option, but it
could fails on ARM64, eg,
percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000
percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000
percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000
then we could get
WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838
and the system could not boot successfully.
Let's implement page mapping percpu first chunk allocator as a fallback
to the embedding allocator to increase the robustness of the system.
Kefeng Wang [Fri, 5 Nov 2021 20:39:41 +0000 (13:39 -0700)]
vmalloc: choose a better start address in vm_area_register_early()
Percpu embedded first chunk allocator is the firstly option, but it
could fail on ARM64, eg,
percpu: max_distance=0x5fcfdc640000 too large for vmalloc space 0x781fefff0000
percpu: max_distance=0x600000540000 too large for vmalloc space 0x7dffb7ff0000
percpu: max_distance=0x5fff9adb0000 too large for vmalloc space 0x5dffb7ff0000
then we could get to
WARNING: CPU: 15 PID: 461 at vmalloc.c:3087 pcpu_get_vm_areas+0x488/0x838
and the system cannot boot successfully.
Let's implement page mapping percpu first chunk allocator as a fallback
to the embedding allocator to increase the robustness of the system.
Also fix a crash when both NEED_PER_CPU_PAGE_FIRST_CHUNK and
KASAN_VMALLOC enabled.
Tested on ARM64 qemu with cmdline "percpu_alloc=page".
This patch (of 3):
There are some fixed locations in the vmalloc area be reserved in
ARM(see iotable_init()) and ARM64(see map_kernel()), but for
pcpu_page_first_chunk(), it calls vm_area_register_early() and choose
VMALLOC_START as the start address of vmap area which could be
conflicted with above address, then could trigger a BUG_ON in
vm_area_add_early().
Let's choose a suit start address by traversing the vmlist.
Vasily Averin [Fri, 5 Nov 2021 20:39:37 +0000 (13:39 -0700)]
vmalloc: back off when the current task is OOM-killed
Huge vmalloc allocation on heavy loaded node can lead to a global memory
shortage. Task called vmalloc can have worst badness and be selected by
OOM-killer, however taken fatal signal does not interrupt allocation
cycle. Vmalloc repeat page allocaions again and again, exacerbating the
crisis and consuming the memory freed up by another killed tasks.
After a successful completion of the allocation procedure, a fatal
signal will be processed and task will be destroyed finally. However it
may not release the consumed memory, since the allocated object may have
a lifetime unrelated to the completed task. In the worst case, this can
lead to the host will panic due to "Out of memory and no killable
processes..."
This patch allows OOM-killer to break vmalloc cycle, makes OOM more
effective and avoid host panic. It does not check oom condition
directly, however, and breaks page allocation cycle when fatal signal
was received.
This may trigger some hidden problems, when caller does not handle
vmalloc failures, or when rollaback after failed vmalloc calls own
vmallocs inside. However all of these scenarios are incorrect: vmalloc
does not guarantee successful allocation, it has never been called with
__GFP_NOFAIL and threfore either should not be used for any rollbacks or
should handle such errors correctly and not lead to critical failures.
Link: https://lkml.kernel.org/r/83efc664-3a65-2adb-d7c4-2885784cf109@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/vmalloc: check various alignments when debugging
Before we did not guarantee a free block with lowest start address for
allocations with alignment >= PAGE_SIZE. Because an alignment overhead
was included into a search length like below:
length = size + align - 1;
doing so we make sure that a bigger block would fit after applying an
alignment adjustment. Now there is no such limitation, i.e. any
alignment that user wants to apply will result to a lowest address of
returned free area.
Link: https://lkml.kernel.org/r/20211004142829.22222-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: David Hildenbrand <david@redhat.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Ping Fang <pifang@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/vmalloc: do not adjust the search size for alignment overhead
We used to include an alignment overhead into a search length, in that
case we guarantee that a found area will definitely fit after applying a
specific alignment that user specifies. From the other hand we do not
guarantee that an area has the lowest address if an alignment is >=
PAGE_SIZE.
It means that, when a user specifies a special alignment together with a
range that corresponds to an exact requested size then an allocation
will fail. This is what happens to KASAN, it wants the free block that
exactly matches a specified range during onlining memory banks:
Fix it by making sure that find_vmap_lowest_match() returns lowest start
address with any given alignment value, i.e. for alignments bigger then
PAGE_SIZE the algorithm rolls back toward parent nodes checking right
sub-trees if the most left free block did not fit due to alignment
overhead.
Link: https://lkml.kernel.org/r/20211004142829.22222-1-urezki@gmail.com Fixes: 862791f65069 ("mm/vmalloc.c: keep track of free blocks for vmap allocation") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reported-by: Ping Fang <pifang@redhat.com> Tested-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Eric Dumazet [Fri, 5 Nov 2021 20:39:28 +0000 (13:39 -0700)]
mm/vmalloc: make sure to dump unpurged areas in /proc/vmallocinfo
If last va found in vmap_area_list does not have a vm pointer,
vmallocinfo.s_show() returns 0, and show_purge_info() is not called as
it should.
Link: https://lkml.kernel.org/r/20211001170815.73321-1-eric.dumazet@gmail.com Fixes: 0fc23c152b4f ("mm/vmalloc: do not keep unpurged areas in the busy tree") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Pengfei Li <lpf.vector@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Zijlstra [Fri, 5 Nov 2021 20:39:22 +0000 (13:39 -0700)]
mm/vmalloc: don't allow VM_NO_GUARD on vmap()
The vmalloc guard pages are added on top of each allocation, thereby
isolating any two allocations from one another. The top guard of the
lower allocation is the bottom guard guard of the higher allocation etc.
Therefore VM_NO_GUARD is dangerous; it breaks the basic premise of
isolating separate allocations.
There are only two in-tree users of this flag, neither of which use it
through the exported interface. Ensure it stays this way.
Link: https://lkml.kernel.org/r/YUMfdA36fuyZ+/xt@hirez.programming.kicks-ass.net Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: Will Deacon <will@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Uladzislau Rezki <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vasily Averin [Fri, 5 Nov 2021 20:39:19 +0000 (13:39 -0700)]
mm/vmalloc: repair warn_alloc()s in __vmalloc_area_node()
Commit f78c85662e21 ("mm: cleanup the gfp_mask handling in
__vmalloc_area_node") added __GFP_NOWARN to gfp_mask unconditionally
however it disabled all output inside warn_alloc() call. This patch
saves original gfp_mask and provides it to all warn_alloc() calls.
Link: https://lkml.kernel.org/r/f4f3187b-9684-e426-565d-827c2a9bbb0e@virtuozzo.com Fixes: f78c85662e21 ("mm: cleanup the gfp_mask handling in __vmalloc_area_node") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Uladzislau Rezki (Sony) <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lucas De Marchi [Fri, 5 Nov 2021 20:39:10 +0000 (13:39 -0700)]
include/linux/io-mapping.h: remove fallback for writecombine
The fallback was introduced in commit 387b86fc4843 ("io-mapping: Fixup
for different names of writecombine") to fix the build on microblaze.
5 years later, it seems all archs now provide a pgprot_writecombine(),
so just remove the other possible fallbacks. For microblaze,
pgprot_writecombine() is available since commit 46e5d4f2b300
("microblaze: Provide pgprot_device/writecombine macros for nommu").
This is build-tested on microblaze with a hack to always build
mm/io-mapping.o and without DIYing on an x86-only macro
(_PAGE_CACHE_MASK)
Link: https://lkml.kernel.org/r/20211020204838.1142908-1-lucas.demarchi@intel.com Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Safonov [Fri, 5 Nov 2021 20:39:06 +0000 (13:39 -0700)]
mm/mremap: don't account pages in vma_to_resize()
All this vm_unacct_memory(charged) dance seems to complicate the life
without a good reason. Furthermore, it seems not always done right on
error-pathes in mremap_to(). And worse than that: this `charged'
difference is sometimes double-accounted for growing MREMAP_DONTUNMAP
mremap()s in move_vma():
if (security_vm_enough_memory_mm(mm, new_len >> PAGE_SHIFT))
Let's not do this. Account memory in mremap() fast-path for growing
VMAs or in move_vma() for actually moving things. The same simpler way
as it's done by vm_stat_account(), but with a difference to call
security_vm_enough_memory_mm() before copying/adjusting VMA.
Originally noticed by Chen Wandun:
https://lkml.kernel.org/r/20210717101942.120607-1-chenwandun@huawei.com
Link: https://lkml.kernel.org/r/20210721131320.522061-1-dima@arista.com Fixes: 1a798f218ef0 ("mm/mremap: add MREMAP_DONTUNMAP to mremap()") Signed-off-by: Dmitry Safonov <dima@arista.com> Acked-by: Brian Geffon <bgeffon@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Chen Wandun <chenwandun@huawei.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Yongjun <weiyongjun1@huawei.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Liu Song [Fri, 5 Nov 2021 20:39:03 +0000 (13:39 -0700)]
mm/mprotect.c: avoid repeated assignment in do_mprotect_pkey()
After adjustment, the repeated assignment of "prev" is avoided, and the
readability of the code is improved.
Link: https://lkml.kernel.org/r/20211012152444.4127-1-fishland@aliyun.com Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Liu Song <liu.song11@zte.com.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Lukas Bulwahn [Fri, 5 Nov 2021 20:39:00 +0000 (13:39 -0700)]
memory: remove unused CONFIG_MEM_BLOCK_SIZE
Commit f4139dd75369 ("[PATCH] memory hotplug: sysfs and add/remove
functions") defines CONFIG_MEM_BLOCK_SIZE, but this has never been
utilized anywhere.
It is a good practice to keep the CONFIG_* defines exclusively for the
Kbuild system. So, drop this unused definition.
This issue was noticed due to running ./scripts/checkkconfigsymbols.py.
Link: https://lkml.kernel.org/r/20211006120354.7468-1-lukas.bulwahn@gmail.com Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Documentation: update pagemap with shmem exceptions
This patch follows the discussions on previous documentation patch
threads [1][2]. It presents the exception case of shared memory
management from the pagemap's point of view. It briefly describes what
is missing, why it is missing and alternatives to the pagemap for page
info retrieval in user space.
In short, the kernel does not keep track of PTEs for swapped out shared
pages within the processes that references them. Thus, the
proc/pid/pagemap tool cannot print the swap destination of the shared
memory pages, instead setting the pagemap entry to zero for both
non-allocated and swapped out pages. This can create confusion for
users who need information on swapped out pages.
The reasons why maintaining the PTEs of all swapped out shared pages
among all processes while maintaining similar performance is not a
trivial task, or a desirable change, have been discussed extensively
[1][3][4][5]. There are also arguments for why this arguably missing
information should eventually be exposed to the user in either a future
pagemap patch, or by an alternative tool.
Qi Zheng [Fri, 5 Nov 2021 20:38:41 +0000 (13:38 -0700)]
mm: remove redundant smp_wmb()
The smp_wmb() which is in the __pte_alloc() is used to ensure all ptes
setup is visible before the pte is made visible to other CPUs by being
put into page tables. We only need this when the pte is actually
populated, so move it to pmd_install(). __pte_alloc_kernel(),
__p4d_alloc(), __pud_alloc() and __pmd_alloc() are similar to this case.
We can also defer smp_wmb() to the place where the pmd entry is really
populated by preallocated pte. There are two kinds of user of
preallocated pte, one is filemap & finish_fault(), another is THP. The
former does not need another smp_wmb() because the smp_wmb() has been
done by pmd_install(). Fortunately, the latter also does not need
another smp_wmb() because there is already a smp_wmb() before populating
the new pte when the THP uses a preallocated pte to split a huge pmd.
Link: https://lkml.kernel.org/r/20210901102722.47686-3-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mika Penttila <mika.penttila@nextfour.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 5 Nov 2021 20:38:34 +0000 (13:38 -0700)]
mm: add zap_skip_check_mapping() helper
Use the helper for the checks. Rename "check_mapping" into
"zap_mapping" because "check_mapping" looks like a bool but in fact it
stores the mapping itself. When it's set, we check the mapping (it must
be non-NULL). When it's cleared we skip the check, which works like the
old way.
Move the duplicated comments to the helper too.
Link: https://lkml.kernel.org/r/20210915181538.11288-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 5 Nov 2021 20:38:31 +0000 (13:38 -0700)]
mm: drop first_index/last_index in zap_details
The first_index/last_index parameters in zap_details are actually only
used in unmap_mapping_range_tree(). At the meantime, this function is
only called by unmap_mapping_pages() once.
Instead of passing these two variables through the whole stack of page
zapping code, remove them from zap_details and let them simply be
parameters of unmap_mapping_range_tree(), which is inlined.
Link: https://lkml.kernel.org/r/20210915181535.11238-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam Howlett <liam.howlett@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 5 Nov 2021 20:38:28 +0000 (13:38 -0700)]
mm: clear vmf->pte after pte_unmap_same() returns
pte_unmap_same() will always unmap the pte pointer. After the unmap,
vmf->pte will not be valid any more, we should clear it.
It was safe only because no one is accessing vmf->pte after
pte_unmap_same() returns, since the only caller of pte_unmap_same() (so
far) is do_swap_page(), where vmf->pte will in most cases be overwritten
very soon.
Directly pass in vmf into pte_unmap_same() and then we can also avoid
the long parameter list too, which should be a nice cleanup.
Link: https://lkml.kernel.org/r/20210915181533.11188-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Liam Howlett <liam.howlett@oracle.com> Acked-by: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Jerome Glisse <jglisse@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 5 Nov 2021 20:38:24 +0000 (13:38 -0700)]
mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte
Patch series "mm: A few cleanup patches around zap, shmem and uffd", v4.
IMHO all of them are very nice cleanups to existing code already,
they're all small and self-contained. They'll be needed by uffd-wp
coming series.
This patch (of 4):
It was conditionally done previously, as there's one shmem special case
that we use SetPageDirty() instead. However that's not necessary and it
should be easier and cleaner to do it unconditionally in
mfill_atomic_install_pte().
The most recent discussion about this is here, where Hugh explained the
history of SetPageDirty() and why it's possible that it's not required
at all:
After the change: case (1) should have its SetPageDirty replaced by the
dirty bit on pte (so we unify them together, finally), case (2) should
have no functional change at all as it has page_in_cache==false, case
(3) may add a dirty bit to the pte. However since case (3) is
UFFDIO_CONTINUE for shmem, it's merely 100% sure the page is dirty after
all because UFFDIO_CONTINUE normally requires another process to modify
the page cache and kick the faulted thread, so should not make a real
difference either.
This should make it much easier to follow on which case will set dirty
for uffd, as we'll simply set it all now for all uffd related ioctls.
Meanwhile, no special handling of SetPageDirty() if there's no need.
Link: https://lkml.kernel.org/r/20210915181456.10739-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20210915181456.10739-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Liam Howlett <liam.howlett@oracle.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Yang Shi <shy828301@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Jerome Glisse <jglisse@redhat.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Annotating a pointer from __user to kernel and then back again might
confuse sparse. In copy_huge_page_from_user() it can be avoided by
removing the intermediate variable since it is never used.
Link: https://lkml.kernel.org/r/20210914150820.19326-1-amit.kachhap@arm.com Signed-off-by: Amit Daniel Kachhap <amit.kachhap@arm.com> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Vincenzo Frascino <Vincenzo.Frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peng Liu [Fri, 5 Nov 2021 20:38:12 +0000 (13:38 -0700)]
mm/mmap.c: fix a data race of mm->total_vm
The variable mm->total_vm could be accessed concurrently during mmaping
and system accounting as noticed by KCSAN,
BUG: KCSAN: data-race in __acct_update_integrals / mmap_region
read-write to 0xffffa40267bd14c8 of 8 bytes by task 15609 on cpu 3:
mmap_region+0x6dc/0x1400
do_mmap+0x794/0xca0
vm_mmap_pgoff+0xdf/0x150
ksys_mmap_pgoff+0xe1/0x380
do_syscall_64+0x37/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
read to 0xffffa40267bd14c8 of 8 bytes by interrupt on cpu 2:
__acct_update_integrals+0x187/0x1d0
acct_account_cputime+0x3c/0x40
update_process_times+0x5c/0x150
tick_sched_timer+0x184/0x210
__run_hrtimer+0x119/0x3b0
hrtimer_interrupt+0x350/0xaa0
__sysvec_apic_timer_interrupt+0x7b/0x220
asm_call_irq_on_stack+0x12/0x20
sysvec_apic_timer_interrupt+0x4d/0x80
asm_sysvec_apic_timer_interrupt+0x12/0x20
smp_call_function_single+0x192/0x2b0
perf_install_in_context+0x29b/0x4a0
__se_sys_perf_event_open+0x1a98/0x2550
__x64_sys_perf_event_open+0x63/0x70
do_syscall_64+0x37/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reported by Kernel Concurrency Sanitizer on:
CPU: 2 PID: 15610 Comm: syz-executor.3 Not tainted 5.10.0+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
Ubuntu-1.8.2-1ubuntu1 04/01/2014
In vm_stat_account which called by mmap_region, increase total_vm, and
__acct_update_integrals may read total_vm at the same time. This will
cause a data race which lead to undefined behaviour. To avoid potential
bad read/write, volatile property and barrier are both used to avoid
undefined behaviour.
Vasily Averin [Fri, 5 Nov 2021 20:38:09 +0000 (13:38 -0700)]
memcg: prohibit unconditional exceeding the limit of dying tasks
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It is assumed that the amount of the memory charged by those
tasks is bound and most of the memory will get released while the task
is exiting. This is resembling a heuristic for the global OOM situation
when tasks get access to memory reserves. There is no global memory
shortage at the memcg level so the memcg heuristic is more relieved.
The above assumption is overly optimistic though. E.g. vmalloc can
scale to really large requests and the heuristic would allow that. We
used to have an early break in the vmalloc allocator for killed tasks
but this has been reverted by commit 0d9c1c2c0095 ("Revert "vmalloc:
back off when the current task is killed""). There are likely other
similar code paths which do not check for fatal signals in an
allocation&charge loop. Also there are some kernel objects charged to a
memcg which are not bound to a process life time.
It has been observed that it is not really hard to trigger these
bypasses and cause global OOM situation.
One potential way to address these runaways would be to limit the amount
of excess (similar to the global OOM with limited oom reserves). This
is certainly possible but it is not really clear how much of an excess
is desirable and still protects from global OOMs as that would have to
consider the overall memcg configuration.
This patch is addressing the problem by removing the heuristic
altogether. Bypass is only allowed for requests which either cannot
fail or where the failure is not desirable while excess should be still
limited (e.g. atomic requests). Implementation wise a killed or dying
task fails to charge if it has passed the OOM killer stage. That should
give all forms of reclaim chance to restore the limit before the failure
(ENOMEM) and tell the caller to back off.
In addition, this patch renames should_force_charge() helper to
task_is_dying() because now its use is not associated witch forced
charging.
This patch depends on pagefault_out_of_memory() to not trigger
out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
and cause a global OOM killer.
Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Shakeel Butt <shakeelb@google.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Michal Hocko [Fri, 5 Nov 2021 20:38:06 +0000 (13:38 -0700)]
mm, oom: do not trigger out_of_memory from the #PF
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory. This can happen for 2
different reasons. a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.
The latter is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations. Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path. Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.
Moreover if the allocation failed and the OOM killer hasn't been invoked
then we are unlikely to do the right thing from the #PF context because
we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.
This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.
[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into a
real disaster. This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8). There was no kmem specific reclaim for
the separate limit so the only way to handle kmem hard limit was to return
with ENOMEM. In upstream the problem will be fixed by removing the
outdated kmem limit, however stable and LTS kernels cannot do it and are
still affected. This patch fixes the problem and should be backported
into stable/LTS.]
Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com Signed-off-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Vasily Averin [Fri, 5 Nov 2021 20:38:02 +0000 (13:38 -0700)]
mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks
Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit. It can be misused and allowed to trigger global OOM from inside
a memcg-limited container. On the other hand if memcg fails allocation,
called from inside #PF handler it triggers global OOM from inside
pagefault_out_of_memory().
To prevent these problems this patchset:
(a) removes execution of out_of_memory() from
pagefault_out_of_memory(), becasue nobody can explain why it is
necessary.
(b) allow memcg to fail allocation of dying/killed tasks.
This patch (of 3):
Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory which in turn executes
out_out_memory() and can kill a random task.
An allocation might fail when the current task is the oom victim and
there are no memory reserves left. The OOM killer is already handled at
the page allocator level for the global OOM and at the charging level
for the memcg one. Both have much more information about the scope of
allocation/charge request. This means that either the OOM killer has
been invoked properly and didn't lead to the allocation success or it
has been skipped because it couldn't have been invoked. In both cases
triggering it from here is pointless and even harmful.
It makes much more sense to let the killed task die rather than to wake
up an eternally hungry oom-killer and send him to choose a fatter victim
for breakfast.
Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Suggested-by: Michal Hocko <mhocko@suse.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Fri, 5 Nov 2021 20:37:59 +0000 (13:37 -0700)]
mm: list_lru: only add memcg-aware lrus to the global lru list
The non-memcg-aware lru is always skiped when traversing the global lru
list, which is not efficient. We can only add the memcg-aware lru to
the global lru list instead to make traversing more efficient.
Link: https://lkml.kernel.org/r/20211025124353.55781-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Fri, 5 Nov 2021 20:37:56 +0000 (13:37 -0700)]
mm: memcontrol: remove the kmem states
Now the kmem states is only used to indicate whether the kmem is
offline. However, we can set ->kmemcg_id to -1 to indicate whether the
kmem is offline. Finally, we can remove the kmem states to simplify the
code.
Link: https://lkml.kernel.org/r/20211025125259.56624-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Fri, 5 Nov 2021 20:37:53 +0000 (13:37 -0700)]
mm: memcontrol: remove kmemcg_id reparenting
Since slab objects and kmem pages are charged to object cgroup instead
of memory cgroup, memcg_reparent_objcgs() will reparent this cgroup and
all its descendants to its parent cgroup. This already makes further
list_lru_add()'s add elements to the parent's list. So it is
unnecessary to change kmemcg_id of an offline cgroup to its parent's id.
It just wastes CPU cycles. Just remove the redundant code.
Link: https://lkml.kernel.org/r/20211025125102.56533-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Fri, 5 Nov 2021 20:37:50 +0000 (13:37 -0700)]
mm: list_lru: fix the return value of list_lru_count_one()
Since commit 47fe02897112 ("memcg: reparent list_lrus and free kmemcg_id
on css offline"), ->nr_items can be negative during memory cgroup
reparenting. In this case, list_lru_count_one() will return an unusual
and huge value, which can surprise users. At least for now it hasn't
affected any users. But it is better to let list_lru_count_ont()
returns zero when ->nr_items is negative.
Link: https://lkml.kernel.org/r/20211025124910.56433-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Muchun Song [Fri, 5 Nov 2021 20:37:47 +0000 (13:37 -0700)]
mm: list_lru: remove holding lru lock
Since commit c6746a97c9b8 ("rcu: Consolidate PREEMPT and !PREEMPT
synchronize_rcu()"), the critical section of spin lock can serve as an
RCU read-side critical section which already allows readers that hold
nlru->lock to avoid taking rcu lock. So just remove holding lock.
Link: https://lkml.kernel.org/r/20211025124534.56345-1-songmuchun@bytedance.com Signed-off-by: Muchun Song <songmuchun@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Fri, 5 Nov 2021 20:37:44 +0000 (13:37 -0700)]
memcg, kmem: further deprecate kmem.limit_in_bytes
The deprecation process of kmem.limit_in_bytes started with the commit 825aaeaca17 ("memcg, kmem: deprecate kmem.limit_in_bytes") which also
explains in detail the motivation behind the deprecation. To summarize,
it is the unexpected behavior on hitting the kmem limit. This patch
moves the deprecation process to the next stage by disallowing to set
the kmem limit. In future we might just remove the kmem.limit_in_bytes
file completely.
[akpm@linux-foundation.org: s/ENOTSUPP/EOPNOTSUPP/]
[arnd@arndb.de: mark cancel_charge() inline] Link: https://lkml.kernel.org/r/20211022070542.679839-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20211019153408.2916808-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Cc: Vasily Averin <vvs@virtuozzo.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Len Baker [Fri, 5 Nov 2021 20:37:40 +0000 (13:37 -0700)]
mm/list_lru.c: prefer struct_size over open coded arithmetic
As noted in the "Deprecated Interfaces, Language Features, Attributes,
and Conventions" documentation [1], size calculations (especially
multiplication) should not be performed in memory allocator (or similar)
function arguments due to the risk of them overflowing.
This could lead to values wrapping around and a smaller allocation being
made than the caller was expecting. Using those allocations could lead
to linear overflows of heap memory and other misbehaviors.
So, use the struct_size() helper to do the arithmetic instead of the
argument "size + count * size" in the kvmalloc() functions.
Also, take the opportunity to refactor the memcpy() call to use the
flex_array_size() helper.
This code was detected with the help of Coccinelle and audited and fixed
manually.
Waiman Long [Fri, 5 Nov 2021 20:37:37 +0000 (13:37 -0700)]
mm/memcg: remove obsolete memcg_free_kmem()
Since commit f3243f4916d4 ("mm: kmem: make memcg_kmem_enabled()
irreversible"), the only thing memcg_free_kmem() does is to call
memcg_offline_kmem() when the memcg is still online which can happen
when online_css() fails due to -ENOMEM.
However, the name memcg_free_kmem() is confusing and it is more clear
and straight forward to call memcg_offline_kmem() directly from
mem_cgroup_css_free().
Link: https://lkml.kernel.org/r/20211005202450.11775-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Roman Gushchin <guro@fb.com> Reviewed-by: Aaron Tomlin <atomlin@redhat.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Fri, 5 Nov 2021 20:37:34 +0000 (13:37 -0700)]
memcg: unify memcg stat flushing
The memcg stats can be flushed in multiple context and potentially in
parallel too. For example multiple parallel user space readers for
memcg stats will contend on the rstat locks with each other. There is
no need for that. We just need one flusher and everyone else can
benefit.
In addition after 0ea3bb5633be ("memcg: infrastructure to flush memcg
stats") the kernel periodically flush the memcg stats from the root, so,
the other flushers will potentially have much less work to do.
Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Fri, 5 Nov 2021 20:37:31 +0000 (13:37 -0700)]
memcg: flush stats only if updated
At the moment, the kernel flushes the memcg stats on every refault and
also on every reclaim iteration. Although rstat maintains per-cpu
update tree but on the flush the kernel still has to go through all the
cpu rstat update tree to check if there is anything to flush. This
patch adds the tracking on the stats update side to make flush side more
clever by skipping the flush if there is no update.
The stats update codepath is very sensitive performance wise for many
workloads and benchmarks. So, we can not follow what the commit 0ea3bb5633be ("memcg: infrastructure to flush memcg stats") did which
was triggering async flush through queue_work() and caused a lot
performance regression reports. That got reverted by the commit 4cd61058cf95 ("memcg: flush lruvec stats in the refault").
In this patch we kept the stats update codepath very minimal and let the
stats reader side to flush the stats only when the updates are over a
specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH).
To evaluate the impact of this patch, an 8 GiB tmpfs file is created on
a system with swap-on-zram and the file was pushed to swap through
memory.force_empty interface. On reading the whole file, the memcg stat
flush in the refault code path is triggered. With this patch, we
observed 63% reduction in the read time of 8 GiB file.
Link: https://lkml.kernel.org/r/20211001190040.48086-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Reviewed-by: "Michal Koutný" <mkoutny@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Peter Xu [Fri, 5 Nov 2021 20:37:28 +0000 (13:37 -0700)]
mm/memcg: drop swp_entry_t* in mc_handle_file_pte()
It is unused after the rework of commit bcd80a94eb13 ("mm: use
find_get_incore_page in memcontrol").
Link: https://lkml.kernel.org/r/20210916193014.80129-1-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of calling put_page() one page at a time, pop pages off the list
if their refcount was too high and pass the remainder to
put_unref_page_list(). This should be a speed improvement, but I have
no measurements to support that. Current callers do not care about
performance, but I hope to add some which do.
Link: https://lkml.kernel.org/r/20211007192138.561673-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Anthony Yznaga <anthony.yznaga@oracle.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rafael Aquini [Fri, 5 Nov 2021 20:37:22 +0000 (13:37 -0700)]
mm/swapfile: fix an integer overflow in swap_show()
This one is just a minor nuisance for people going through /proc/swaps
if any of their swapareas is bigger than, or equal to 1073741824 pages
(4TB).
seq_printf() format string casts as uint the conversion from pages to
KB, and that will overflow in the aforementioned case.
Albeit being almost unthinkable that someone would actually set up such
big of a single swaparea, there is a ticket recently filed against RHEL:
https://bugzilla.redhat.com/show_bug.cgi?id=2008812
Given that all other codesites that use format strings for the same swap
pages-to-KB conversion do cast it as ulong, this patch just follows
suit.
Link: https://lkml.kernel.org/r/20211006184011.2579054-1-aquini@redhat.com Signed-off-by: Rafael Aquini <aquini@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
John Hubbard [Fri, 5 Nov 2021 20:37:16 +0000 (13:37 -0700)]
mm/gup: further simplify __gup_device_huge()
Commit a910743da56c ("mm: gup: fix potential pgmap refcnt leak in
__gup_device_huge()") simplified the return paths, but didn't go quite
far enough, as discussed in [1].
Remove the "ret" variable entirely, because there is enough information
already available to provide the return value.
Jens Axboe [Fri, 5 Nov 2021 20:37:13 +0000 (13:37 -0700)]
mm: move more expensive part of XA setup out of mapping check
The fast path here is not needing any writeback, yet we spend time
setting up the xarray lookup data upfront. Move the part that actually
needs to iterate the address space mapping into a separate helper,
saving ~30% of the time here.
It is not safe to check page->index without holding the page lock. It
can be changed if the page is moved between the swap cache and the page
cache for a shmem file, for example. There is a VM_BUG_ON below which
checks page->index is correct after taking the page lock.
Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org Fixes: ce869ae5cf28 ("mm: add and use find_lock_entries") Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reported-by: <syzbot+c87be4f669d920c76330@syzkaller.appspotmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jens Axboe [Fri, 5 Nov 2021 20:37:07 +0000 (13:37 -0700)]
mm: don't read i_size of inode unless we need it
We always go through i_size_read(), and we rarely end up needing it.
Push the read to down where we need to check it, which avoids it for
most cases.
It looks like we can even remove this check entirely, which might be
worth pursuing. But at least this takes it out of the hot path.
Link: https://lkml.kernel.org/r/6b67981f-57d4-c80e-bc07-6020aa601381@kernel.dk Signed-off-by: Jens Axboe <axboe@kernel.dk> Acked-by: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Move grabbing and releasing the bdi refcount out of the common
wb_init/wb_exit helpers into code that is only used for the non-default
memcg driven bdi_writeback structures.
David Howells [Fri, 5 Nov 2021 20:36:49 +0000 (13:36 -0700)]
mm: stop filemap_read() from grabbing a superfluous page
Under some circumstances, filemap_read() will allocate sufficient pages
to read to the end of the file, call readahead/readpages on them and
copy the data over - and then it will allocate another page at the EOF
and call readpage on that and then ignore it. This is unnecessary and a
waste of time and resources.
filemap_read() *does* check for this, but only after it has already done
the allocation and I/O. Fix this by checking before calling
filemap_get_pages() also.
Kees Cook [Fri, 5 Nov 2021 20:36:42 +0000 (13:36 -0700)]
percpu: add __alloc_size attributes for better bounds checking
As already done in GrapheneOS, add the __alloc_size attribute for
appropriate percpu allocator interfaces, to provide additional hinting
for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
compiler optimizations.
Note that due to the implementation of the percpu API, this is unlikely
to ever actually provide compile-time checking beyond very simple
non-SMP builds. But, since they are technically allocators, mark them
as such.
Link: https://lkml.kernel.org/r/20210930222704.2631604-9-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Daniel Micay <danielmicay@gmail.com> Acked-by: Dennis Zhou <dennis@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Christoph Lameter <cl@linux.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: David Rientjes <rientjes@google.com> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Fri, 5 Nov 2021 20:36:38 +0000 (13:36 -0700)]
mm/page_alloc: add __alloc_size attributes for better bounds checking
As already done in GrapheneOS, add the __alloc_size attribute for
appropriate page allocator interfaces, to provide additional hinting for
better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
compiler optimizations.
Link: https://lkml.kernel.org/r/20210930222704.2631604-8-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Daniel Micay <danielmicay@gmail.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Fri, 5 Nov 2021 20:36:34 +0000 (13:36 -0700)]
mm/vmalloc: add __alloc_size attributes for better bounds checking
As already done in GrapheneOS, add the __alloc_size attribute for
appropriate vmalloc allocator interfaces, to provide additional hinting
for better bounds checking, assisting CONFIG_FORTIFY_SOURCE and other
compiler optimizations.
Link: https://lkml.kernel.org/r/20210930222704.2631604-7-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Daniel Micay <danielmicay@gmail.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kees Cook [Fri, 5 Nov 2021 20:36:31 +0000 (13:36 -0700)]
mm/kvmalloc: add __alloc_size attributes for better bounds checking
As already done in GrapheneOS, add the __alloc_size attribute for
regular kvmalloc interfaces, to provide additional hinting for better
bounds checking, assisting CONFIG_FORTIFY_SOURCE and other compiler
optimizations.
Link: https://lkml.kernel.org/r/20210930222704.2631604-6-keescook@chromium.org Signed-off-by: Kees Cook <keescook@chromium.org> Co-developed-by: Daniel Micay <danielmicay@gmail.com> Signed-off-by: Daniel Micay <danielmicay@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Whitcroft <apw@canonical.com> Cc: Dennis Zhou <dennis@kernel.org> Cc: Dwaipayan Ray <dwaipayanray1@gmail.com> Cc: Joe Perches <joe@perches.com> Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Alexandre Bounine <alex.bou9@gmail.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jing Xiangfeng <jingxiangfeng@huawei.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: kernel test robot <lkp@intel.com> Cc: Matt Porter <mporter@kernel.crashing.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Souptick Joarder <jrdr.linux@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>