Aaron Lu [Thu, 5 Apr 2018 23:24:06 +0000 (16:24 -0700)]
mm/free_pcppages_bulk: update pcp->count inside
Matthew Wilcox found that all callers of free_pcppages_bulk() currently
update pcp->count immediately after so it's natural to do it inside
free_pcppages_bulk().
No functionality or performance change is expected from this patch.
Link: http://lkml.kernel.org/r/20180301062845.26038-2-aaron.lu@intel.com Signed-off-by: Aaron Lu <aaron.lu@intel.com> Suggested-by: Matthew Wilcox <willy@infradead.org> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Huang Ying <ying.huang@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Kemi Wang <kemi.wang@intel.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Thu, 5 Apr 2018 23:24:02 +0000 (16:24 -0700)]
mm, compaction: drain pcps for zone when kcompactd fails
It's possible for free pages to become stranded on per-cpu pagesets
(pcps) that, if drained, could be merged with buddy pages on the zone's
free area to form large order pages, including up to MAX_ORDER.
Consider a verbose example using the tools/vm/page-types tool at the
beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates
a slab page). Pages on pcps do not have any page flags set.
The compaction migration scanner is attempting to defragment this memory
since it is at the beginning of the zone. It has done so quite well,
all movable pages have been migrated. From pfn [0x109955, 0x109fab),
there are only buddy pages and pages without flags set.
These pages may be stranded on pcps that could otherwise allow this
memory to be coalesced if freed back to the zone free area. It is
possible that some of these pages may not be on pcps and that something
has called alloc_pages() and used the memory directly, but we rely on
the absence of __GFP_MOVABLE in these cases to allocate from
MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE
pageblocks as free as possible.
These buddy and pcp pages, spanning 1,621 pages, could be coalesced and
allow for three transparent hugepages to be dynamically allocated.
Running the numbers for all such spans on the system, it was found that
there were over 400 such spans of only buddy pages and pages without
flags set at the time this /proc/kpageflags sample was collected.
Without this support, there were _no_ order-9 or order-10 pages free.
When kcompactd fails to defragment memory such that a cc.order page can
be allocated, drain all pcps for the zone back to the buddy allocator so
this stranding cannot occur. Compaction for that order will
subsequently be deferred, which acts as a ratelimit on this drain.
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803010340100.88270@chino.kir.corp.google.com Signed-off-by: David Rientjes <rientjes@google.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: make should_failslab always available for fault injection
should_failslab() is a convenient function to hook into for directed
error injection into kmalloc(). However, it is only available if a
config flag is set.
The following BCC script, for example, fails kmalloc() calls after a
btrfs umount:
int kprobe__should_failslab(struct pt_regs *ctx) {
u64 key = 1;
u64 *res;
res = flag.lookup(&key);
if (res != 0) {
bpf_override_return(ctx, -ENOMEM);
}
return 0;
}
"""
b = BPF(text=prog)
while 1:
b.kprobe_poll()
This patch refactors the should_failslab implementation so that the
function is always available for error injection, independent of flags.
This change would be similar in nature to commit f5490d3ec921 ("block:
Add should_fail_bio() for bpf error injection").
Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.com Signed-off-by: Howard McLauchlan <hmclauchlan@fb.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Alexei Starovoitov <ast@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Minchan Kim [Thu, 5 Apr 2018 23:23:42 +0000 (16:23 -0700)]
mm: swap: unify cluster-based and vma-based swap readahead
This patch makes do_swap_page() not need to be aware of two different
swap readahead algorithms. Just unify cluster-based and vma-based
readahead function call.
Minchan Kim [Thu, 5 Apr 2018 23:23:39 +0000 (16:23 -0700)]
mm: swap: clean up swap readahead
When I see recent change of swap readahead, I am very unhappy about
current code structure which diverges two swap readahead algorithm in
do_swap_page. This patch is to clean it up.
Main motivation is that fault handler doesn't need to be aware of
readahead algorithms but just should call swapin_readahead.
As first step, this patch cleans up a little bit but not perfect (I just
separate for review easier) so next patch will make the goal complete.
mm,vmscan: don't pretend forward progress upon shrinker_rwsem contention
Since we no longer use return value of shrink_slab() for normal reclaim,
the comment is no longer true. If some do_shrink_slab() call takes
unexpectedly long (root cause of stall is currently unknown) when
register_shrinker()/unregister_shrinker() is pending, trying to drop
caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
loop if many mem_cgroup are defined. For safety, let's not pretend
forward progress.
Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Dave Chinner <dchinner@redhat.com> Cc: Glauber Costa <glommer@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently if z3fold couldn't find an unbuddied page it would first try
to pull a page off the stale list. The problem with this approach is
that we can't 100% guarantee that the page is not processed by the
workqueue thread at the same time unless we run cancel_work_sync() on
it, which we can't do if we're in an atomic context. So let's just
limit stale list usage to non-atomic contexts only.
Link: http://lkml.kernel.org/r/47ab51e7-e9c1-d30e-ab17-f734dbc3abce@gmail.com Signed-off-by: Vitaly Vul <vitaly.vul@sony.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: <Oleksiy.Avramchenko@sony.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/huge_memory.c: reorder operations in __split_huge_page_tail()
THP split makes non-atomic change of tail page flags. This is almost ok
because tail pages are locked and isolated but this breaks recent
changes in page locking: non-atomic operation could clear bit
PG_waiters.
As a result concurrent sequence get_page_unless_zero() -> lock_page()
might block forever. Especially if this page was truncated later.
Fix is trivial: clone flags before unfreezing page reference counter.
This race exists since commit 0483375c55d8 ("mm: add PageWaiters
indicating tasks are waiting for a page bit") while unsave unfreeze
itself was added in commit 1b96ac7e66f9 ("thp: cleanup
split_huge_page()").
clear_compound_head() also must be called before unfreezing page
reference because after successful get_page_unless_zero() might follow
put_page() which needs correct compound_head().
And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
is made especially for that and has semantic of smp_store_release().
Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Acked-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm: fix races between address_space dereference and free in page_evicatable
When page_mapping() is called and the mapping is dereferenced in
page_evicatable() through shrink_active_list(), it is possible for the
inode to be truncated and the embedded address space to be freed at the
same time. This may lead to the following race.
CPU1 CPU2
truncate(inode) shrink_active_list()
... page_evictable(page)
truncate_inode_page(mapping, page);
delete_from_page_cache(page)
spin_lock_irqsave(&mapping->tree_lock, flags);
__delete_from_page_cache(page, NULL)
page_cache_tree_delete(..)
... mapping = page_mapping(page);
page->mapping = NULL;
...
spin_unlock_irqrestore(&mapping->tree_lock, flags);
page_cache_free_page(mapping, page)
put_page(page)
if (put_page_testzero(page)) -> false
- inode now has no pages and can be freed including embedded address_space
mapping_unevictable(mapping)
test_bit(AS_UNEVICTABLE, &mapping->flags);
- we've dereferenced mapping which is potentially already free.
Similar race exists between swap cache freeing and page_evicatable()
too.
The address_space in inode and swap cache will be freed after a RCU
grace period. So the races are fixed via enclosing the page_mapping()
and address_space usage in rcu_read_lock/unlock(). Some comments are
added in code to make it clear what is protected by the RCU read lock.
Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com Signed-off-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Minchan Kim <minchan@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Rientjes [Thu, 5 Apr 2018 23:23:09 +0000 (16:23 -0700)]
mm, page_alloc: extend kernelcore and movablecore for percent
Both kernelcore= and movablecore= can be used to define the amount of
ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
the system memory capacity to be known when specifying the command line,
however.
This introduces the ability to define both kernelcore= and movablecore=
as a percentage of total system memory. This is convenient for systems
software that wants to define the amount of ZONE_MOVABLE, for example,
as a proportion of a system's memory rather than a hardcoded byte value.
To define the percentage, the final character of the parameter should be
a '%'.
mhocko: "why is anyone using these options nowadays?"
rientjes:
:
: Fragmentation of non-__GFP_MOVABLE pages due to low on memory
: situations can pollute most pageblocks on the system, as much as 1GB of
: slab being fragmented over 128GB of memory, for example. When the
: amount of kernel memory is well bounded for certain systems, it is
: better to aggressively reclaim from existing MIGRATE_UNMOVABLE
: pageblocks rather than eagerly fallback to others.
:
: We have additional patches that help with this fragmentation if you're
: interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
: pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
: draining of pcp lists back to the zone free area to prevent stranding.
mm: hwpoison: disable memory error handling on 1GB hugepage
Recently the following BUG was reported:
Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
Memory failure: 0x3c0000: recovery action for huge page: Recovered
BUG: unable to handle kernel paging request at ffff8dfcc0003000
IP: gup_pgd_range+0x1f0/0xc20
PGD 17ae72067 P4D 17ae72067 PUD 0
Oops: 0000 [#1] SMP PTI
...
CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014
You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
a 1GB hugepage. This happens because get_user_pages_fast() is not aware
of a migration entry on pud that was created in the 1st madvise() event.
I think that conversion to pud-aligned migration entry is working, but
other MM code walking over page table isn't prepared for it. We need
some time and effort to make all this work properly, so this patch
avoids the reported bug by just disabling error handling for 1GB
hugepage.
[n-horiguchi@ah.jp.nec.com: v2] Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Punit Agrawal <punit.agrawal@arm.com> Tested-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 5 Apr 2018 23:23:00 +0000 (16:23 -0700)]
mm/memory_hotplug: optimize memory hotplug
During memory hotplugging we traverse struct pages three times:
1. memset(0) in sparse_add_one_section()
2. loop in __add_section() to set do: set_page_node(page, nid); and
SetPageReserved(page);
3. loop in memmap_init_zone() to call __init_single_pfn()
This patch removes the first two loops, and leaves only loop 3. All
struct pages are initialized in one place, the same as it is done during
boot.
The benefits:
- We improve memory hotplug performance because we are not evicting the
cache several times and also reduce loop branching overhead.
- Remove condition from hotpath in __init_single_pfn(), that was added
in order to fix the problem that was reported by Bharata in the above
email thread, thus also improve performance during normal boot.
- Make memory hotplug more similar to the boot memory initialization
path because we zero and initialize struct pages only in one
function.
- Simplifies memory hotplug struct page initialization code, and thus
enables future improvements, such as multi-threading the
initialization of struct pages in order to improve hotplug
performance even further on larger machines.
[pasha.tatashin@oracle.com: v5] Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 5 Apr 2018 23:22:56 +0000 (16:22 -0700)]
mm/memory_hotplug: don't read nid from struct page during hotplug
During memory hotplugging the probe routine will leave struct pages
uninitialized, the same as it is currently done during boot. Therefore,
we do not want to access the inside of struct pages before
__init_single_page() is called during onlining.
Because during hotplug we know that pages in one memory block belong to
the same numa node, we can skip the checking. We should keep checking
for the boot case.
[pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()] Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 5 Apr 2018 23:22:52 +0000 (16:22 -0700)]
mm/memory_hotplug: optimize probe routine
When memory is hotplugged pages_correctly_reserved() is called to verify
that the added memory is present, this routine traverses through every
struct page and verifies that PageReserved() is set. This is a slow
operation especially if a large amount of memory is added.
Instead of checking every page, it is enough to simply check that the
section is present, has mapping (struct page array is allocated), and
the mapping is online.
In addition, we should not excpect that probe routine sets flags in
struct page, as the struct pages have not yet been initialized. The
initialization should be done in __init_single_page(), the same as
during boot.
Link: http://lkml.kernel.org/r/20180215165920.8570-5-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
During boot we poison struct page memory in order to ensure that no one
is accessing this memory until the struct pages are initialized in
__init_single_page().
This patch adds more scrutiny to this checking by making sure that flags
do not equal the poison pattern when they are accessed. The pattern is
all ones.
Since node id is also stored in struct page, and may be accessed quite
early, we add this enforcement into page_to_nid() function as well.
Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n
[pasha.tatashin@oracle.com: v4] Link: http://lkml.kernel.org/r/20180215165920.8570-4-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180213193159.14606-4-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Ingo Molnar <mingo@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Baoquan He <bhe@redhat.com> Cc: Bharata B Rao <bharata@linux.vnet.ibm.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 5 Apr 2018 23:22:43 +0000 (16:22 -0700)]
x86/mm/memory_hotplug: determine block size based on the end of boot memory
Memory sections are combined into "memory block" chunks. These chunks
are the units upon which memory can be added and removed.
On x86, the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hot-plugging/hot-removing can be broken.
Memory sections are combined into "memory block" chunks. These chunks
are the units upon which memory can be added and removed.
On x86 the new memory may be added after the end of the boot memory,
therefore, if block size does not align with end of boot memory, memory
hotplugging/hotremoving can be broken.
Currently, whenever machine is booted with more than 64G the block size
is unconditionally increased to 2G from the base 128M. This is done in
order to reduce number of memory device files in sysfs:
/sys/devices/system/memory/memoryXXX
We must use the largest allowed block size that aligns to the next
address to be able to hotplug the next block of memory.
So, when memory is larger or equal to 64G, we check the end address and
find the largest block size that is still power of two but smaller or
equal to 2G.
Before, the fix:
Run qemu with:
-m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G
static bool pages_correctly_reserved(unsigned long start_pfn)
205 if (WARN_ON_ONCE(!pfn_valid(pfn)))
This function loops through every section in the newly added memory
block and verifies that the first pfn is valid, meaning section exists,
has mapping (struct page array), and is online.
The block size on x86 is usually 128M, but when machine is booted with
more than 64G of memory, the block size is changed to 2G: $ cat
/sys/devices/system/memory/block_size_bytes 80000000
During memory hotplug, and hotremove we verify that the range is section
size aligned, but we actually must verify that it is block size aligned,
because that is the proper unit for hotplug operations. See:
Documentation/memory-hotplug.txt
So, when the start_pfn of newly added memory is not block size aligned,
we can get a memory block that has only part of it with properly
populated sections.
In our case the start_pfn starts from the last_pfn (end of physical
memory).
Yang Shi [Thu, 5 Apr 2018 23:22:35 +0000 (16:22 -0700)]
mm: thp: fix potential clearing to referenced flag in page_idle_clear_pte_refs_one()
For PTE-mapped THP, the compound THP has not been split to normal 4K
pages yet, the whole THP is considered referenced if any one of sub page
is referenced.
When walking PTE-mapped THP by pvmw, all relevant PTEs will be checked
to retrieve referenced bit. But, the current code just returns the
result of the last PTE. If the last PTE has not referenced, the
referenced flag will be cleared.
Just set referenced when ptep{pmdp}_clear_young_notify() returns true.
Link: http://lkml.kernel.org/r/1518212451-87134-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Reported-by: Gang Deng <gavin.dg@linux.alibaba.com> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 5 Apr 2018 23:22:31 +0000 (16:22 -0700)]
mm: initialize pages on demand during boot
Deferred page initialization allows the boot cpu to initialize a small
subset of the system's pages early in boot, with other cpus doing the
rest later on.
It is, however, problematic to know how many pages the kernel needs
during boot. Different modules and kernel parameters may change the
requirement, so the boot cpu either initializes too many pages or runs
out of memory.
To fix that, initialize early pages on demand. This ensures the kernel
does the minimum amount of work to initialize pages during boot and
leaves the rest to be divided in the multithreaded initialization path
(deferred_init_memmap).
The on-demand code is permanently disabled using static branching once
deferred pages are initialized. After the static branch is changed to
false, the overhead is up-to two branch-always instructions if the zone
watermark check fails or if rmqueue fails.
Sergey Senozhatsky noticed that while deferred pages currently make
sense only on NUMA machines (we start one thread per latency node),
CONFIG_NUMA is not a requirement for CONFIG_DEFERRED_STRUCT_PAGE_INIT,
so that is also must be addressed in the patch.
[akpm@linux-foundation.org: fix typo in comment, make deferred_pages static]
[pasha.tatashin@oracle.com: fix min() type mismatch warning] Link: http://lkml.kernel.org/r/20180212164543.26592-1-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: use zone_to_nid() in deferred_grow_zone()] Link: http://lkml.kernel.org/r/20180214163343.21234-2-pasha.tatashin@oracle.com
[pasha.tatashin@oracle.com: might_sleep warning] Link: http://lkml.kernel.org/r/20180306192022.28289-1-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: s/spin_lock/spin_lock_irq/ in page_alloc_init_late()]
[pasha.tatashin@oracle.com: v5] Link: http://lkml.kernel.org/r/20180309220807.24961-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: tweak comments]
[pasha.tatashin@oracle.com: v6] Link: http://lkml.kernel.org/r/20180313182355.17669-3-pasha.tatashin@oracle.com
[akpm@linux-foundation.org: coding-style fixes] Link: http://lkml.kernel.org/r/20180209192216.20509-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Daniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: Steven Sistare <steven.sistare@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Acked-by: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pavel Tatashin [Thu, 5 Apr 2018 23:22:27 +0000 (16:22 -0700)]
mm: disable interrupts while initializing deferred pages
Vlastimil Babka reported about a window issue during which when deferred
pages are initialized, and the current version of on-demand
initialization is finished, allocations may fail. While this is highly
unlikely scenario, since this kind of allocation request must be large,
and must come from interrupt handler, we still want to cover it.
We solve this by initializing deferred pages with interrupts disabled,
and holding node_size_lock spin lock while pages in the node are being
initialized. The on-demand deferred page initialization that comes
later will use the same lock, and thus synchronize with
deferred_init_memmap().
It is unlikely for threads that initialize deferred pages to be
interrupted. They run soon after smp_init(), but before modules are
initialized, and long before user space programs. This is why there is
no adverse effect of having these threads running with interrupts
disabled.
[pasha.tatashin@oracle.com: v6] Link: http://lkml.kernel.org/r/20180313182355.17669-2-pasha.tatashin@oracle.com Link: http://lkml.kernel.org/r/20180309220807.24961-2-pasha.tatashin@oracle.com Signed-off-by: Pavel Tatashin <pasha.tatashin@oracle.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Steven Sistare <steven.sistare@oracle.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: AKASHI Takahiro <takahiro.akashi@linaro.org> Cc: Gioh Kim <gi-oh.kim@profitbricks.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Paul Burton <paul.burton@mips.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Thu, 5 Apr 2018 23:22:23 +0000 (16:22 -0700)]
mm/swap_slots.c: use conditional compilation
For mm/swap_slots.c, use the traditional Linux method of conditional
compilation and linking instead of always compiling it by using #ifdef
CONFIG_SWAP and #endif for the entire source file (excluding header
files).
Link: http://lkml.kernel.org/r/c2a47015-0b5a-d0d9-8bc7-9984c049df20@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Acked-by: Tim Chen <tim.c.chen@linux.intel.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/migrate: rename migration reason MR_CMA to MR_CONTIG_RANGE
alloc_contig_range() initiates compaction and eventual migration for the
purpose of either CMA or HugeTLB allocations. At present, the reason
code remains the same MR_CMA for either of these cases. Let's make it
MR_CONTIG_RANGE which will appropriately reflect the reason code in both
these cases.
Link: http://lkml.kernel.org/r/20180202091518.18798-1-khandual@linux.vnet.ibm.com Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Colin Ian King [Thu, 5 Apr 2018 23:22:01 +0000 (16:22 -0700)]
mm/ksm.c: make stable_node_dup() static
stable_node_dup() is local to the source and does not need to be in
global scope, so make it static.
Cleans up sparse warning:
mm/ksm.c:1321:13: warning: symbol 'stable_node_dup' was not declared. Should it be static?
Link: http://lkml.kernel.org/r/20180206221005.12642-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The kasan quarantine is designed to delay freeing slab objects to catch
use-after-free. The quarantine can be large (several percent of machine
memory size). When kmem_caches are deleted related objects are flushed
from the quarantine but this requires scanning the entire quarantine
which can be very slow. We have seen the kernel busily working on this
while holding slab_mutex and badly affecting cache_reaper, slabinfo
readers and memcg kmem cache creations.
It can easily reproduced by following script:
yes . | head -1000000 | xargs stat > /dev/null
for i in `seq 1 10`; do
seq 500 | (cd /cg/memory && xargs mkdir)
seq 500 | xargs -I{} sh -c 'echo $BASHPID > \
/cg/memory/{}/tasks && exec stat .' > /dev/null
seq 500 | (cd /cg/memory && xargs rmdir)
done
This patch is based on the observation that if the kmem_cache to be
destroyed is empty then there should not be any objects of this cache in
the quarantine.
Without the patch the script got stuck for couple of hours. With the
patch the script completed within a second.
Link: http://lkml.kernel.org/r/20180327230603.54721-1-shakeelb@google.com Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Greg Thelen <gthelen@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/slab_common.c: remove test if cache name is accessible
Since commit 1b450a31cdeb ("mm/sl[aou]b: Move duping of slab name to
slab_common.c"), the kernel always duplicates the slab cache name when
creating a slab cache, so the test if the slab name is accessible is
useless.
Link: http://lkml.kernel.org/r/alpine.LRH.2.02.1803231133310.22626@file01.intranet.prod.int.rdu2.redhat.com Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
However for SLUB in debug kernel, the sizes were same. On further
inspection it is found that SLUB always use kmem_cache.object_size to
measure the kmem_cache.size while SLAB use the given kmem_cache.size.
In the debug kernel the slab's size can be larger than its object_size.
Thus in the creation of non-root slab, the SLAB uses the root's size as
base to calculate the non-root slab's size and thus non-root slab's size
can be larger than the root slab's size. For SLUB, the non-root slab's
size is measured based on the root's object_size and thus the size will
remain same for root and non-root slab.
This patch makes slab's object_size the default base to measure the
slab's size.
Link: http://lkml.kernel.org/r/20180313165428.58699-1-shakeelb@google.com Fixes: 9e12f82eb190 ("memcg, slab: separate memcg vs root cache creation paths") Signed-off-by: Shakeel Butt <shakeelb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If kmem case sizes are 32-bit, then usecopy region should be too.
Link: http://lkml.kernel.org/r/20180305200730.15812-21-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Miller <davem@davemloft.net> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
slab: make create_kmalloc_cache() work with 32-bit sizes
KMALLOC_MAX_CACHE_SIZE is 32-bit so is the largest kmalloc cache size.
Christoph said:
:
: Ok SLABs maximum allocation size is limited to 32M (see
: include/linux/slab.h:
:
: #define KMALLOC_SHIFT_HIGH ((MAX_ORDER + PAGE_SHIFT - 1) <= 25 ? \
: (MAX_ORDER + PAGE_SHIFT - 1) : 25)
:
: And SLUB/SLOB pass all larger requests to the page allocator anyways.
Link: http://lkml.kernel.org/r/20180305200730.15812-4-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch start a series of converting SLUB (mostly) to "unsigned int".
1) Most integers in the code are in fact unsigned entities: array
indexes, lengths, buffer sizes, allocation orders. It is therefore
better to use unsigned variables
2) Some integers in the code are either "size_t" or "unsigned long" for
no reason.
size_t usually comes from people trying to maintain type correctness
and figuring out that "sizeof" operator returns size_t or
memset/memcpy takes size_t so should everything passed to it.
However the number of 4GB+ objects in the kernel is very small. Most,
if not all, dynamically allocated objects with kmalloc() or
kmem_cache_create() aren't actually big. Maintaining wide types
doesn't do anything.
64-bit ops are bigger than 32-bit on our beloved x86_64,
so try to not use 64-bit where it isn't necessary
(read: everywhere where integers are integers not pointers)
3) in case of SLAB allocators, there are additional limitations
*) page->inuse, page->objects are only 16-/15-bit,
*) cache size was always 32-bit
*) slab orders are small, order 20 is needed to go 64-bit on x86_64
(PAGE_SIZE << order)
Basically everything is 32-bit except kmalloc(1ULL<<32) which gets
shortcut through page allocator.
Christoph said:
:
: That changes with large base page size on power and ARM64 f.e. but then
: we do not want to encourage larger allocations through slab anyways.
Link: http://lkml.kernel.org/r/20180305200730.15812-2-adobriyan@gmail.com Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mm/slub.c: use jitter-free reference while printing age
When SLUB_DEBUG catches some issues, it prints all the required debug
info. However, in a few cases where allocation and free of the object
has happened in a very short time, 'age' might be misleading. See the
example below:
=============================================================================
BUG kmalloc-256 (Tainted: G W O ): Poison overwritten
-----------------------------------------------------------------------------
...
INFO: Allocated in binder_transaction+0x4b0/0x2448 age=731 cpu=3 pid=5314
...
INFO: Freed in binder_free_transaction+0x2c/0x58 age=735 cpu=6 pid=2079
...
Object fffffff14956a870: 6b 6b 6b 6b 6b 6b 6b 6b 67 6b 6b 6b 6b 6b 6b a5 kkkkkkkkgkkkk
In this case, object got freed later but 'age' shows otherwise. This
could be because, while printing this info, we print allocation traces
first and free traces thereafter. In between, if we get schedule out or
jiffies increment, (jiffies - t->when) could become meaningless.
Use the jitter free reference to calculate age.
New output will exactly be same. 'age' is still staying with single
jiffies ref in both prints.
Change-Id: I0846565807a4229748649bbecb1ffb743d71fcd8 Link: http://lkml.kernel.org/r/1520492010-19389-1-git-send-email-cpandya@codeaurora.org Signed-off-by: Chintan Pandya <cpandya@codeaurora.org> Acked-by: Christoph Lameter <cl@linux.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
fs: don't flush pagecache when expanding block device
When changing the size of a block device, its all caches are freed.
It's necessary on shrinking to prevent spurious I/Os to the disappeared
region. However, on expanding, such kind of I/Os doesn't happen.
Similar things can be considered for btrfs filesystem resize and
resize2fs, but they are designed not to drop caches when expanding.
Therefore this patch removes unnecessary cache drop.
Link: http://lkml.kernel.org/r/1521457240-153390-1-git-send-email-shunki-fujita@cybozu.co.jp Signed-off-by: Shunki Fujita <shunki-fujita@cybozu.co.jp> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
net/9p/client.c: fix potential refcnt problem of trans module
When specifying trans_mod multiple times in a mount, it will cause an
inaccurate refcount of the trans module. Also, in the error case of
option parsing, we should put the trans module if we have already got
it.
Link: http://lkml.kernel.org/r/1522154942-57339-1-git-send-email-cgxu519@gmx.com Signed-off-by: Chengguang Xu <cgxu519@gmx.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: David Miller <davem@davemloft.net> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When the user uses some syscall, for example mmap(v9fs_file_mmap), it
will not update atime even if user's was set mnt_flags without
MNT_NOATIME, because v9fs defaults to settine SB_NOATIME in
v9fs_set_super.
For supporting access time updating when the user mounts with relatime,
we should not set SB_NOATIME by default.
Link: http://lkml.kernel.org/r/5AB9A377.6080906@huawei.com Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Greg Kurz <groug@kaod.org> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9p: don't maintain dir i_nlink if the exported fs doesn't either
If the exported filesystem dir on 9p server doesn't maintain accurate
i_nlink count, e.g. always reports i_nlink as 1, then 9p should not
maintain nlink count either, otherwise drop_link would report warning
with i_nlink being zero.
For example:
- overlayfs sets nlink to 1 for merged dir
- ext4 (with dir_nlink feature enabled) sets nlink to 1 if a dir has
more than EXT4_LINK_MAX (65000) links.
In this case, everytime a stat(2) call (getattr) on such exported dirs
on 9p client side, the i_nlink gets reset to 1, then operations like
rmdir(2), unlink(2) and rename(2) would cause the dir nlink to go to
zero (then negative), which results in warnings in drop_nlink() and/or
inc_nlink() calls.
This can be reproduced easily as the following steps:
- export a merged overlayfs dir via qemu virtfs to guest
- mount the exported virtfs in guest
- create two sub-directories in the root dir of the mounted 9pfs
- stat the root dir of 9pfs, this resets nlink to 1
- remove all subdirs, the second unlink/rmdir would trigger warning
Fix it by leaving i_nlink to be 1 and don't drop nlink if a directory
has nlink <= 2, which indicates that the underlying exported fs doesn't
maintain nlink count accurately. This follows what ext4 does in
ext4_dec_count().
Link: http://lkml.kernel.org/r/20180312053829.4367-1-eguan@linux.alibaba.com Signed-off-by: Eryu Guan <eguan@linux.alibaba.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Tested-by: Roman Kapl <code@rkapl.cz> Cc: Caspar Zhang <caspar@linux.alibaba.com> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: <v9fs-developer@lists.sourceforge.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Kurz [Thu, 5 Apr 2018 23:19:44 +0000 (16:19 -0700)]
net/9p: avoid -ERESTARTSYS leak to userspace
If it was interrupted by a signal, the 9p client may need to send some
more requests to the server for cleanup before returning to userspace.
To avoid such a last minute request to be interrupted right away, the
client memorizes if a signal is pending, clears TIF_SIGPENDING, handles
the request and calls recalc_sigpending() before returning.
Unfortunately, if the transmission of this cleanup request fails for any
reason, the transport returns an error and the client propagates it
right away, without calling recalc_sigpending().
This ends up with -ERESTARTSYS from the initially interrupted request
crawling up to syscall exit, with TIF_SIGPENDING cleared by the cleanup
request. The specific signal handling code, which is responsible for
converting -ERESTARTSYS to -EINTR is not called, and userspace receives
the confusing errno value:
open: Unknown error 512 (512)
This is really hard to hit in real life. I discovered the issue while
working on hot-unplug of a virtio-9p-pci device with an instrumented
QEMU allowing to control request completion.
Both p9_client_zc_rpc() and p9_client_rpc() functions have this buggy
error path actually. Their code flow is a bit obscure and the best
thing to do would probably be a full rewrite: to really ensure this
situation of clearing TIF_SIGPENDING and returning -ERESTARTSYS can
never happen.
But given the general lack of interest for the 9p code, I won't risk
breaking more things. So this patch simply fixes the buggy paths in
both functions with a trivial label+goto.
Thanks to Laurent Dufour for his help and suggestions on how to find the
root cause and how to fix it.
Link: http://lkml.kernel.org/r/152062809886.10599.7361006774123053312.stgit@bahia.lan Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Cc: Eric Van Hensbergen <ericvh@gmail.com> Cc: Ron Minnich <rminnich@sandia.gov> Cc: Latchesar Ionkov <lucho@ionkov.net> Cc: David Miller <davem@davemloft.net> Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Thu, 5 Apr 2018 23:19:40 +0000 (16:19 -0700)]
ocfs2/dlm: clean up unused variable in dlm_process_recovery_data
Link: http://lkml.kernel.org/r/1522734135-7933-1-git-send-email-ge.changwei@h3c.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gang He [Thu, 5 Apr 2018 23:19:32 +0000 (16:19 -0700)]
ocfs2: add duplicated ino number check
Add duplicated ino number check, to avoid adding a file into the file
check list when this file is being checked.
Link: http://lkml.kernel.org/r/1495611866-27360-5-git-send-email-ghe@suse.com Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gang He [Thu, 5 Apr 2018 23:19:29 +0000 (16:19 -0700)]
ocfs2: add kobject for online file check
Use embedded kobject mechanism for online file check feature, this will
avoid to use a global list to save/search per-device online file check
related data, meanwhile, reduce the code lines and make the code logic
clear. The changed code is based on Goldwyn Rodrigues's patches and
ext4 fs code.
Link: http://lkml.kernel.org/r/1495611866-27360-4-git-send-email-ghe@suse.com Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gang He [Thu, 5 Apr 2018 23:19:25 +0000 (16:19 -0700)]
ocfs2: fix some small problems
First, move setting fe_done = 1 in spin lock, avoid bring any potential
race condition.
Second, tune mlog message level from ERROR to NOTICE, since the message
should not belong to error message.
Third, tune errno to -EAGAIN when file check queue is full, this errno
is more appropriate in the case.
Link: http://lkml.kernel.org/r/1495611866-27360-3-git-send-email-ghe@suse.com Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Gang He [Thu, 5 Apr 2018 23:19:22 +0000 (16:19 -0700)]
ocfs2: move some definitions to header file
Patch series "ocfs2: use kobject for online file check", v3.
Use embedded kobject mechanism for online file check feature, this will
avoid to use a global list to save/search per-device online file check
related data. The changed code is based on Goldwyn Rodrigues's patches
and ext4 fs code, there is not any new features added, except some very
small fixes during this code refactoring. Second, the code change does
not affect the underlying file check code. Thank Goldwyn very much.
Compare with second version, add more comments in the patch
descriptions, to make sure each modification is mentioned. Compare with
first version, split the code change into four patches, make sure each
patch will not bring ocfs2 kernel modules compiling errors.
This patch (of 3):
Move some definitions to header file, which will be referenced by other
source files when kobject mechanism is introduced.
Link: http://lkml.kernel.org/r/1495611866-27360-2-git-send-email-ghe@suse.com Signed-off-by: Gang He <ghe@suse.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Thu, 5 Apr 2018 23:19:18 +0000 (16:19 -0700)]
ocfs2: correct spelling mistake for migratable for all
Inspired by the ocfs2 patch to fix the spelling of migrateable to
migratable, I checked all ocfs2 files and found more spelling mistakes.
So correct them all.
Link: http://lkml.kernel.org/r/1521525734-19576-1-git-send-email-ge.changwei@h3c.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mfasheh@versity.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <joseph.qi@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Trivial fix to spelling mistake in mlog message text
Link: http://lkml.kernel.org/r/20180319114101.2051-1-colin.king@canonical.com Signed-off-by: Colin Ian King <colin.king@canonical.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
ocfs2/dlm: wait for dlm recovery done when migrating all lock resources
Wait for dlm recovery done when migrating all lock resources in case that
new lock resource left after leaving dlm domain. And the left lock
resource will cause other nodes BUG.
at last NodeA become the
master of the new lockres
and leave domain:
dlm_leave_domain()
mount:
dlm_join_domain()
touch file and request
for the owner of the new
lockres, but all the
other nodes said 'NO',
so NodeC decide to be
the owner, and send do
assert msg to other
nodes:
dlmlock()
dlm_get_lock_resource()
dlm_do_assert_master()
other nodes receive the msg
and found two masters exist.
at last cause BUG in
dlm_assert_master_handler()
-->BUG();
Link: http://lkml.kernel.org/r/5AAA6E25.7090303@huawei.com Fixes: 8c323894a39f ("dlm: allow dlm do recovery during shutdown") Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Alex Chen <alex.chen@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Thu, 5 Apr 2018 23:19:07 +0000 (16:19 -0700)]
ocfs2/dlm: clean up unused stack variable in dlm_do_local_ast()
Link: http://lkml.kernel.org/r/1521116681-14602-2-git-send-email-ge.changwei@h3c.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Thu, 5 Apr 2018 23:19:03 +0000 (16:19 -0700)]
ocfs2/dlm: clean up unused argument for dlm_destroy_recovery_area()
Link: http://lkml.kernel.org/r/1521116681-14602-1-git-send-email-ge.changwei@h3c.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Obviously, the comment before dlm_do_local_recovery_cleanup() has nothing
to do with it. So remove it.
Link: http://lkml.kernel.org/r/1519371054-4648-1-git-send-email-ge.changwei@h3c.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Acked-by: Joseph Qi <jiangqi903@gmail.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Changwei Ge [Thu, 5 Apr 2018 23:18:56 +0000 (16:18 -0700)]
ocfs2: remove two unused functions from suballoc.c
The two functions are no longer used.
Link: http://lkml.kernel.org/r/1519609595-26229-1-git-send-email-ge.changwei@h3c.com Signed-off-by: Changwei Ge <ge.changwei@h3c.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Cc: Joseph Qi <jiangqi903@gmail.com> Cc: Changwei Ge <ge.changwei@h3c.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jun Piao [Thu, 5 Apr 2018 23:18:48 +0000 (16:18 -0700)]
ocfs2/dlm: don't handle migrate lockres if already in shutdown
We should not handle migrate lockres if we are already in
'DLM_CTXT_IN_SHUTDOWN', as that will cause lockres remains after leaving
dlm domain. At last other nodes will get stuck into infinite loop when
requsting lock from us.
The problem is caused by concurrency umount between nodes. Before
receiveing N1's DLM_BEGIN_EXIT_DOMAIN_MSG, N2 has picked up N1 as the
migrate target. So N2 will continue sending lockres to N1 even though
N1 has left domain.
N1 N2 (owner)
touch file
access the file,
and get pr lock
begin leave domain and
pick up N1 as new owner
begin leave domain and
migrate all lockres done
begin migrate lockres to N1
end leave domain, but
the lockres left
unexpectedly, because
migrate task has passed
[piaojun@huawei.com: v3] Link: http://lkml.kernel.org/r/5A9CBD19.5020107@huawei.com Link: http://lkml.kernel.org/r/5A99F028.2090902@huawei.com Signed-off-by: Jun Piao <piaojun@huawei.com> Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com> Reviewed-by: Joseph Qi <jiangqi903@gmail.com> Reviewed-by: Changwei Ge <ge.changwei@h3c.com> Cc: Mark Fasheh <mark@fasheh.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Junxiao Bi <junxiao.bi@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1521444205-2259-1-git-send-email-changbin.du@intel.com Signed-off-by: Changbin Du <changbin.du@intel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: NeilBrown <neilb@suse.com> Cc: Richard Weinberger <richard@nod.at> Cc: Kate Stewart <kstewart@linuxfoundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Kravetz [Thu, 5 Apr 2018 23:18:21 +0000 (16:18 -0700)]
hugetlbfs: fix bug in pgoff overflow checking
This is a fix for a regression in 32 bit kernels caused by an invalid
check for pgoff overflow in hugetlbfs mmap setup. The check incorrectly
specified that the size of a loff_t was the same as the size of a long.
The regression prevents mapping hugetlbfs files at offsets greater than
4GB on 32 bit kernels.
On 32 bit kernels conversion from a page based unsigned long can not
overflow a loff_t byte offset. Therefore, skip this check if
sizeof(unsigned long) != sizeof(loff_t).
Link: http://lkml.kernel.org/r/20180330145402.5053-1-mike.kravetz@oracle.com Fixes: f9a7dee802e9 ("hugetlbfs: check for pgoff value overflow") Reported-by: Dan Rue <dan.rue@linaro.org> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Tested-by: Anders Roxell <anders.roxell@linaro.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Yisheng Xie <xieyisheng1@huawei.com> Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com> Cc: Nic Losby <blurbdust@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
zboot: fix stack protector in compressed boot phase
Calling __stack_chk_guard_setup() in decompress_kernel() is too late
that stack checking always fails for decompress_kernel() itself. So
remove __stack_chk_guard_setup() and initialize __stack_chk_guard before
we call decompress_kernel().
Original code comes from ARM but also used for MIPS and SH, so fix them
together. If without this fix, compressed booting of these archs will
fail because stack checking is enabled by default (>=4.16).
Link: http://lkml.kernel.org/r/1522226933-29317-1-git-send-email-chenhc@lemote.com Fixes: 7e840f91cd29 ("stackprotector: Introduce CONFIG_CC_STACKPROTECTOR_STRONG") Signed-off-by: Huacai Chen <chenhc@lemote.com> Acked-by: James Hogan <jhogan@kernel.org> Acked-by: Kees Cook <keescook@chromium.org> Acked-by: Rich Felker <dalias@libc.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Russell King <linux@arm.linux.org.uk> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Merge tag 'char-misc-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc updates from Greg KH:
"Here is the big set of char/misc driver patches for 4.17-rc1.
There are a lot of little things in here, nothing huge, but all
important to the different hardware types involved:
- thunderbolt driver updates
- parport updates (people still care...)
- nvmem driver updates
- mei updates (as always)
- hwtracing driver updates
- hyperv driver updates
- extcon driver updates
- ... and a handful of even smaller driver subsystem and individual
driver updates
All of these have been in linux-next with no reported issues"
* tag 'char-misc-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: (149 commits)
hwtracing: Add HW tracing support menu
intel_th: Add ACPI glue layer
intel_th: Allow forcing host mode through drvdata
intel_th: Pick up irq number from resources
intel_th: Don't touch switch routing in host mode
intel_th: Use correct method of finding hub
intel_th: Add SPDX GPL-2.0 header to replace GPLv2 boilerplate
stm class: Make dummy's master/channel ranges configurable
stm class: Add SPDX GPL-2.0 header to replace GPLv2 boilerplate
MAINTAINERS: Bestow upon myself the care for drivers/hwtracing
hv: add SPDX license id to Kconfig
hv: add SPDX license to trace
Drivers: hv: vmbus: do not mark HV_PCIE as perf_device
Drivers: hv: vmbus: respect what we get from hv_get_synint_state()
/dev/mem: Avoid overwriting "err" in read_mem()
eeprom: at24: use SPDX identifier instead of GPL boiler-plate
eeprom: at24: simplify the i2c functionality checking
eeprom: at24: fix a line break
eeprom: at24: tweak newlines
eeprom: at24: refactor at24_probe()
...
Merge tag 'driver-core-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core
Pull driver core updates from Greg KH:
"Here is the "big" set of driver core patches for 4.17-rc1.
There's really not much here, just a bunch of firmware code
refactoring from Luis as he attempts to wrangle that codebase into
something that is managable, along with a bunch of userspace tests for
it. Other than that, a handful of small bugfixes and reverts of things
that didn't work out.
Full details are in the shortlog, it's not all that much.
All of these have been in linux-next for a while with no reported
issues"
* tag 'driver-core-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (30 commits)
drivers: base: remove check for callback in coredump_store()
mt7601u: use firmware_request_cache() to address cache on reboot
firmware: add firmware_request_cache() to help with cache on reboot
firmware: fix typo on pr_info_once() when ignore_sysfs_fallback is used
firmware: explicitly include vmalloc.h
firmware: ensure the firmware cache is not used on incompatible calls
test_firmware: modify custom fallback tests to use unique files
firmware: add helper to check to see if fw cache is setup
firmware: fix checking for return values for fw_add_devm_name()
rename: _request_firmware_load() fw_load_sysfs_fallback()
test_firmware: test three firmware kernel configs using a proc knob
test_firmware: expand on library with shared helpers
firmware: enable to force disable the fallback mechanism at run time
firmware: enable run time change of forcing fallback loader
firmware: move firmware loader into its own directory
firmware: split firmware fallback functionality into its own file
firmware: move loading timeout under struct firmware_fallback_config
firmware: use helpers for setting up a temporary cache timeout
firmware: simplify CONFIG_FW_LOADER_USER_HELPER_FALLBACK further
drivers: base: add description for .coredump() callback
...
Merge tag 'staging-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO updates from Greg KH:
"Here is the big set of Staging/IIO driver patches for 4.17-rc1.
It is a lot, over 500 changes, but not huge by previous kernel release
standards. We deleted more lines than we added again (27k added vs.
91k remvoed), thanks to finally being able to delete the IRDA drivers
and networking code.
We also deleted the ccree crypto driver, but that's coming back in
through the crypto tree to you, in a much cleaned-up form.
Added this round is at lot of "mt7621" device support, which is for an
embedded device that Neil Brown cares about, and of course a handful
of new IIO drivers as well.
And finally, the fsl-mc core code moved out of the staging tree to the
"real" part of the kernel, which is nice to see happen as well.
Full details are in the shortlog, which has all of the tiny cleanup
patches described.
All of these have been in linux-next for a while with no reported
issues"
* tag 'staging-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging: (579 commits)
staging: rtl8723bs: Remove yield call, replace with cond_resched()
staging: rtl8723bs: Replace yield() call with cond_resched()
staging: rtl8723bs: Remove unecessary newlines from 'odm.h'.
staging: rtl8723bs: Rework 'struct _ODM_Phy_Status_Info_' coding style.
staging: rtl8723bs: Rework 'struct _ODM_Per_Pkt_Info_' coding style.
staging: rtl8723bs: Replace NULL pointer comparison with '!'.
staging: rtl8723bs: Factor out rtl8723bs_recv_tasklet() sections.
staging: rtl8723bs: Fix function signature that goes over 80 characters.
staging: rtl8723bs: Fix lines too long in update_recvframe_attrib().
staging: rtl8723bs: Remove unnecessary blank lines in 'rtl8723bs_recv.c'.
staging: rtl8723bs: Change camel case to snake case in 'rtl8723bs_recv.c'.
staging: rtl8723bs: Add missing braces in else statement.
staging: rtl8723bs: Add spaces around ternary operators.
staging: rtl8723bs: Fix lines with trailing open parentheses.
staging: rtl8723bs: Remove unnecessary length #define's.
staging: rtl8723bs: Fix IEEE80211 authentication algorithm constants.
staging: rtl8723bs: Fix alignment in rtw_wx_set_auth().
staging: rtl8723bs: Remove braces from single statement conditionals.
staging: rtl8723bs: Remove unecessary braces from switch statement.
staging: rtl8723bs: Fix newlines in rtw_wx_set_auth().
...
Merge tag 'tty-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial driver updates from Greg KH:
"Here is the big set of tty and serial driver patches for 4.17-rc1
Not all that big really, most are just small fixes and additions to
existing drivers. There's a bunch of work on the imx serial driver
recently for some reason, and a new embedded serial driver added as
well.
Full details are in the shortlog.
All of these have been in the linux-next tree for a while with no
reported issues"
* tag 'tty-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (66 commits)
serial: expose buf_overrun count through proc interface
serial: mvebu-uart: fix tx lost characters
tty: serial: msm_geni_serial: Fix return value check in qcom_geni_serial_probe()
tty: serial: msm_geni_serial: Add serial driver support for GENI based QUP
8250-men-mcb: add support for 16z025 and 16z057
powerpc: Mark the variable earlycon_acpi_spcr_enable maybe_unused
serial: stm32: fix initialization of RS485 mode
ARM: dts: STi: Remove "console=ttyASN" from bootargs for STi boards
vt: change SGR 21 to follow the standards
serdev: Fix typo in serdev_device_alloc
ARM: dts: STi: Fix aliases property name for STi boards
tty: st-asc: Update tty alias
serial: stm32: add support for RS485 hardware control mode
dt-bindings: serial: stm32: add RS485 optional properties
selftests: add devpts selftests
devpts: comment devpts_mntget()
devpts: resolve devpts bind-mounts
devpts: hoist out check for DEVPTS_SUPER_MAGIC
serial: 8250: Add Nuvoton NPCM UART
serial: mxs-auart: disable clks of Alphascale ASM9260
...
Merge tag 'usb-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB/PHY updates from Greg KH:
"Here is the big set of USB and PHY driver patches for 4.17-rc1.
Lots of USB typeC work happened this round, with code moving from the
staging directory into the "real" part of the kernel, as well as new
infrastructure being added to be able to handle the different types of
"roles" that typeC requires.
There is also the normal huge set of USB gadget controller and driver
updates, along with XHCI changes, and a raft of other tiny fixes all
over the USB tree. And the PHY driver updates are merged in here as
well as they interacted with the USB drivers in some places.
All of these have been in linux-next for a while with no reported
issues"
* tag 'usb-4.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: (250 commits)
Revert "USB: serial: ftdi_sio: add Id for Physik Instrumente E-870"
usb: musb: gadget: misplaced out of bounds check
usb: chipidea: imx: Fix ULPI on imx53
usb: chipidea: imx: Cleanup ci_hdrc_imx_platform_flag
usb: chipidea: usbmisc: small clean up
usb: chipidea: usbmisc: evdo can be set e/o reset
usb: chipidea: usbmisc: evdo is only specific to OTG port
USB: serial: ftdi_sio: add Id for Physik Instrumente E-870
usb: dwc3: gadget: never call ->complete() from ->ep_queue()
usb: gadget: udc: core: update usb_ep_queue() documentation
usb: host: Remove the deprecated ATH79 USB host config options
usb: roles: Fix return value check in intel_xhci_usb_probe()
USB: gadget: f_midi: fixing a possible double-free in f_midi
usb: core: Add USB_QUIRK_DELAY_CTRL_MSG to usbcore quirks
usb: core: Copy parameter string correctly and remove superfluous null check
USB: announce bcdDevice as well as idVendor, idProduct.
USB:fix USB3 devices behind USB3 hubs not resuming at hibernate thaw
usb: hub: Reduce warning to notice on power loss
USB: serial: ftdi_sio: add support for Harman FirmwareHubEmulator
USB: serial: cp210x: add ELDAT Easywave RX09 id
...
Pull networking fixes from David Miller:
"This fixes some fallout from the net-next merge the other day, plus
some non-merge-window-related bug fixes:
1) Fix sparse warnings in bcmgenet, systemport, b53, and mt7530
(Florian Fainelli)
2) pptp does a bogus dst_release() on a route we have a single
refcount on, and attached to a socket, which needs that refcount
(Eric Dumazet)
3) UDP connected sockets on ipv6 can race with route update handling,
resulting in a pre-PMTU update route still stuck on the socket and
thus continuing to get ICMPV6_PKT_TOOBIG errors. We end up never
seeing the updated route. (Alexey Kodanev)
4) Missing list initializer(s) in TIPC (Jon Maloy)
5) Connect phy early to prevent crashes in lan78xx driver (Alexander
Graf)
6) Fix build with modular NVMEM (Arnd Bergmann)
7) netdevsim canot mark nsim_devlink_net_ops and nsim_fib_net_ops as
__net_initdata, as these are references from module unload
unconditionally (Arnd Bergmann)"
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (23 commits)
netdevsim: remove incorrect __net_initdata annotations
sfc: remove ctpio_dmabuf_start from stats
inet: frags: fix ip6frag_low_thresh boundary
tipc: Fix namespace violation in tipc_sk_fill_sock_diag
net: avoid unneeded atomic operation in ip*_append_data()
nvmem: disallow modular CONFIG_NVMEM
net: hns3: fix length overflow when CONFIG_ARM64_64K_PAGES
nfp: use full 40 bits of the NSP buffer address
lan78xx: Connect phy early
nfp: add a separate counter for packets with CHECKSUM_COMPLETE
tipc: Fix missing list initializations in struct tipc_subscription
ipv6: udp: set dst cache for a connected sk if current not valid
ipv6: udp: convert 'connected' to bool type in udpv6_sendmsg()
ipv6: allow to cache dst for a connected sk in ip6_sk_dst_lookup_flow()
ipv6: add a wrapper for ip6_dst_store() with flowi6 checks
net: phy: marvell10g: add thermal hwmon device
pptp: remove a buggy dst release in pptp_connect()
net: dsa: mt7530: Use NULL instead of plain integer
net: dsa: b53: Fix sparse warnings in b53_mmap.c
af_unix: remove redundant lockdep class
...
Merge tag 'riscv-for-linus-4.17-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux
Pull RISC-V updates from Palmer Dabbelt:
"This contains the new features we'd like to incorporate into the
RISC-V port for 4.17. We might have a bit more stuff land later in the
merge window, but I wanted to get this out earlier just so everyone
can see where we currently stand.
A short summary of the changes is:
- We've added support for dynamic ftrace on RISC-V targets.
- There have been a handful of cleanups to our atomic and locking
routines. They now more closely match the released RISC-V memory
model draft.
- Our module loading support has been cleaned up and is now enabled
by default, despite some limitations still existing.
- A patch to define COMMANDLINE_FORCE instead of COMMANDLINE_OVERRIDE
so the generic device tree code picks up handling all our command
line stuff.
There's more information in the merge commits for each patch set"
* tag 'riscv-for-linus-4.17-mw0' of git://git.kernel.org/pub/scm/linux/kernel/git/palmer/riscv-linux: (21 commits)
RISC-V: Rename CONFIG_CMDLINE_OVERRIDE to CONFIG_CMDLINE_FORCE
RISC-V: Add definition of relocation types
RISC-V: Enable module support in defconfig
RISC-V: Support SUB32 relocation type in kernel module
RISC-V: Support ADD32 relocation type in kernel module
RISC-V: Support ALIGN relocation type in kernel module
RISC-V: Support RVC_BRANCH/JUMP relocation type in kernel modulewq
RISC-V: Support HI20/LO12_I/LO12_S relocation type in kernel module
RISC-V: Support CALL relocation type in kernel module
RISC-V: Support GOT_HI20/CALL_PLT relocation type in kernel module
RISC-V: Add section of GOT.PLT for kernel module
RISC-V: Add sections of PLT and GOT for kernel module
riscv/atomic: Strengthen implementations with fences
riscv/spinlock: Strengthen implementations with fences
riscv/barrier: Define __smp_{store_release,load_acquire}
riscv/ftrace: Add HAVE_FUNCTION_GRAPH_RET_ADDR_PTR support
riscv/ftrace: Add DYNAMIC_FTRACE_WITH_REGS support
riscv/ftrace: Add ARCH_SUPPORTS_FTRACE_OPS support
riscv/ftrace: Add dynamic function graph tracer support
riscv/ftrace: Add dynamic function tracer support
...
Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 updates from Will Deacon:
"Nothing particularly stands out here, probably because people were
tied up with spectre/meltdown stuff last time around. Still, the main
pieces are:
- Rework of our CPU features framework so that we can whitelist CPUs
that don't require kpti even in a heterogeneous system
- Support for the IDC/DIC architecture extensions, which allow us to
elide instruction and data cache maintenance when writing out
instructions
- Removal of the large memory model which resulted in suboptimal
codegen by the compiler and increased the use of literal pools,
which could potentially be used as ROP gadgets since they are
mapped as executable
- Rework of forced signal delivery so that the siginfo_t is
well-formed and handling of show_unhandled_signals is consolidated
and made consistent between different fault types
- More siginfo cleanup based on the initial patches from Eric
Biederman
- Some small ACPI IORT updates and cleanups from Lorenzo Pieralisi
- Misc cleanups and non-critical fixes"
* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (70 commits)
arm64: uaccess: Fix omissions from usercopy whitelist
arm64: fpsimd: Split cpu field out from struct fpsimd_state
arm64: tlbflush: avoid writing RES0 bits
arm64: cmpxchg: Include linux/compiler.h in asm/cmpxchg.h
arm64: move percpu cmpxchg implementation from cmpxchg.h to percpu.h
arm64: cmpxchg: Include build_bug.h instead of bug.h for BUILD_BUG
arm64: lse: Include compiler_types.h and export.h for out-of-line LL/SC
arm64: fpsimd: include <linux/init.h> in fpsimd.h
drivers/perf: arm_pmu_platform: do not warn about affinity on uniprocessor
perf: arm_spe: include linux/vmalloc.h for vmap()
Revert "arm64: Revert L1_CACHE_SHIFT back to 6 (64-byte cache line size)"
arm64: cpufeature: Avoid warnings due to unused symbols
arm64: Add work around for Arm Cortex-A55 Erratum 1024718
arm64: Delay enabling hardware DBM feature
arm64: Add MIDR encoding for Arm Cortex-A55 and Cortex-A35
arm64: capabilities: Handle shared entries
arm64: capabilities: Add support for checks based on a list of MIDRs
arm64: Add helpers for checking CPU MIDR against a range
arm64: capabilities: Clean up midr range helpers
arm64: capabilities: Change scope of VHE to Boot CPU feature
...
Merge branch 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq updates from Thomas Gleixner:
"The usual pile of boring changes:
- Consolidate tasklet functions to share code instead of duplicating
it
- The first step for making the low level entry handler management on
multi-platform kernels generic
- A new sysfs file which allows to retrieve the wakeup state of
interrupts.
- Ensure that the interrupt thread follows the effective affinity and
not the programmed affinity to avoid cross core wakeups.
- Two new interrupt controller drivers (Microsemi Ocelot and Qualcomm
PDC)
- Fix the wakeup path clock handling for Reneasas interrupt chips.
- Rework the boot time register reset for ARM GIC-V2/3
- Better suspend/resume support for ARM GIV-V3/ITS
- Add missing locking to the ARM GIC set_type() callback
- Small fixes for the irq simulator code
- SPDX identifiers for the irq core code and removal of boiler plate
- Small cleanups all over the place"
* 'irq-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
openrisc: Set CONFIG_MULTI_IRQ_HANDLER
arm64: Set CONFIG_MULTI_IRQ_HANDLER
genirq: Make GENERIC_IRQ_MULTI_HANDLER depend on !MULTI_IRQ_HANDLER
irqchip/gic: Take lock when updating irq type
irqchip/gic: Update supports_deactivate static key to modern api
irqchip/gic-v3: Ensure GICR_CTLR.EnableLPI=0 is observed before enabling
irqchip: Add a driver for the Microsemi Ocelot controller
dt-bindings: interrupt-controller: Add binding for the Microsemi Ocelot interrupt controller
irqchip/gic-v3: Probe for SCR_EL3 being clear before resetting AP0Rn
irqchip/gic-v3: Don't try to reset AP0Rn
irqchip/gic-v3: Do not check trigger configuration of partitionned LPIs
genirq: Remove license boilerplate/references
genirq: Add missing SPDX identifiers
genirq/matrix: Cleanup SPDX identifier
genirq: Cleanup top of file comments
genirq: Pass desc to __irq_free instead of irq number
irqchip/gic-v3: Loudly complain about the use of IRQ_TYPE_NONE
irqchip/gic: Loudly complain about the use of IRQ_TYPE_NONE
RISC-V: Move to the new GENERIC_IRQ_MULTI_HANDLER handler
genirq: Add CONFIG_GENERIC_IRQ_MULTI_HANDLER
...
Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull time(r) updates from Thomas Gleixner:
"A small set of updates for timers and timekeeping:
- The most interesting change is the consolidation of clock MONOTONIC
and clock BOOTTIME.
Clock MONOTONIC behaves now exactly like clock BOOTTIME and does
not longer ignore the time spent in suspend. A new clock
MONOTONIC_ACTIVE is provived which behaves like clock MONOTONIC in
kernels before this change. This allows applications to
programmatically check for the clock MONOTONIC behaviour.
As discussed in the review thread, this has the potential of
breaking user space and we might have to revert this. Knock on wood
that we can avoid that exercise.
- Updates to the NTP mechanism to improve accuracy
- A new kernel internal data structure to aid the ongoing Y2038 work.
- Cleanups and simplifications of the clocksource code.
- Make the alarmtimer code play nicely with debugobjects"
* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
alarmtimer: Init nanosleep alarm timer on stack
y2038: Introduce struct __kernel_old_timeval
tracing: Unify the "boot" and "mono" tracing clocks
hrtimer: Unify MONOTONIC and BOOTTIME clock behavior
posix-timers: Unify MONOTONIC and BOOTTIME clock behavior
timekeeping: Remove boot time specific code
Input: Evdev - unify MONOTONIC and BOOTTIME clock behavior
timekeeping: Make the MONOTONIC clock behave like the BOOTTIME clock
timekeeping: Add the new CLOCK_MONOTONIC_ACTIVE clock
timekeeping/ntp: Determine the multiplier directly from NTP tick length
timekeeping/ntp: Don't align NTP frequency adjustments to ticks
clocksource: Use ATTRIBUTE_GROUPS
clocksource: Use DEVICE_ATTR_RW/RO/WO to define device attributes
clocksource: Don't walk the clocksource list for empty override