Qian Cai [Sat, 15 Aug 2020 00:31:50 +0000 (17:31 -0700)]
mm/swap.c: annotate data races for lru_rotate_pvecs
Read to lru_add_pvec->nr could be interrupted and then write to the same
variable. The write has local interrupt disabled, but the plain reads
result in data races. However, it is unlikely the compilers could do much
damage here given that lru_add_pvec->nr is a "unsigned char" and there is
an existing compiler barrier. Thus, annotate the reads using the
data_race() macro. The data races were reported by KCSAN,
BUG: KCSAN: data-race in lru_add_drain_cpu / rotate_reclaimable_page
write to 0xffff9291ebcb8a40 of 1 bytes by interrupt on cpu 23:
rotate_reclaimable_page+0x2df/0x490
pagevec_add at include/linux/pagevec.h:81
(inlined by) rotate_reclaimable_page at mm/swap.c:259
end_page_writeback+0x1b5/0x2b0
end_swap_bio_write+0x1d0/0x280
bio_endio+0x297/0x560
dec_pending+0x218/0x430 [dm_mod]
clone_endio+0xe4/0x2c0 [dm_mod]
bio_endio+0x297/0x560
blk_update_request+0x201/0x920
scsi_end_request+0x6b/0x4a0
scsi_io_completion+0xb7/0x7e0
scsi_finish_command+0x1ed/0x2a0
scsi_softirq_done+0x1c9/0x1d0
blk_done_softirq+0x181/0x1d0
__do_softirq+0xd9/0x57c
irq_exit+0xa2/0xc0
do_IRQ+0x8b/0x190
ret_from_intr+0x0/0x42
delay_tsc+0x46/0x80
__const_udelay+0x3c/0x40
__udelay+0x10/0x20
kcsan_setup_watchpoint+0x202/0x3a0
__tsan_read1+0xc2/0x100
lru_add_drain_cpu+0xb8/0x3f0
lru_add_drain+0x25/0x40
shrink_active_list+0xe1/0xc80
shrink_lruvec+0x766/0xb70
shrink_node+0x2d6/0xca0
do_try_to_free_pages+0x1f7/0x9a0
try_to_free_pages+0x252/0x5b0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x16e/0x6f0
__handle_mm_fault+0xcd5/0xd40
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffff9291ebcb8a40 of 1 bytes by task 37761 on cpu 23:
lru_add_drain_cpu+0xb8/0x3f0
lru_add_drain_cpu at mm/swap.c:602
lru_add_drain+0x25/0x40
shrink_active_list+0xe1/0xc80
shrink_lruvec+0x766/0xb70
shrink_node+0x2d6/0xca0
do_try_to_free_pages+0x1f7/0x9a0
try_to_free_pages+0x252/0x5b0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x16e/0x6f0
__handle_mm_fault+0xcd5/0xd40
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Qian Cai [Sat, 15 Aug 2020 00:31:47 +0000 (17:31 -0700)]
mm/rmap: annotate a data race at tlb_flush_batched
mm->tlb_flush_batched could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one
write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
try_to_unmap_one+0x59a/0x1ab0
set_tlb_ubc_flush_pending at mm/rmap.c:635
(inlined by) try_to_unmap_one at mm/rmap.c:1538
rmap_walk_anon+0x296/0x650
rmap_walk+0xdf/0x100
try_to_unmap+0x18a/0x2f0
shrink_page_list+0xef6/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
balance_pgdat+0x652/0xd90
kswapd+0x396/0x8d0
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
flush_tlb_batched_pending+0x29/0x90
flush_tlb_batched_pending at mm/rmap.c:682
change_p4d_range+0x5dd/0x1030
change_pte_range at mm/mprotect.c:44
(inlined by) change_pmd_range at mm/mprotect.c:212
(inlined by) change_pud_range at mm/mprotect.c:240
(inlined by) change_p4d_range at mm/mprotect.c:260
change_protection+0x222/0x310
change_prot_numa+0x3e/0x60
task_numa_work+0x219/0x350
task_work_run+0xed/0x140
prepare_exit_to_usermode+0x2cc/0x2e0
ret_from_intr+0x32/0x42
Reported by Kernel Concurrency Sanitizer on:
CPU: 4 PID: 6364 Comm: mtest01 Tainted: G W L 5.5.0-next-20200210+ #5
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
flush_tlb_batched_pending() is under PTL but the write is not, but
mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
shattered. Thus, mark it as an intentional data race by using the data
race macro.
Qian Cai [Sat, 15 Aug 2020 00:31:44 +0000 (17:31 -0700)]
mm/mempool: fix a data race in mempool_free()
mempool_t pool.curr_nr could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in mempool_free / remove_element
write to 0xffffffffa937638c of 4 bytes by task 6359 on cpu 113:
remove_element+0x4a/0x1c0
remove_element at mm/mempool.c:132
mempool_alloc+0x102/0x210
(inlined by) mempool_alloc at mm/mempool.c:399
bio_alloc_bioset+0x106/0x2c0
get_swap_bio+0x49/0x230
__swap_writepage+0x680/0xc30
swap_writepage+0x9c/0xf0
pageout+0x33e/0xae0
shrink_page_list+0x1f57/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
<snip>
read to 0xffffffffa937638c of 4 bytes by interrupt on cpu 64:
mempool_free+0x3e/0x150
mempool_free at mm/mempool.c:492
bio_free+0x192/0x280
bio_put+0x91/0xd0
end_swap_bio_write+0x1d8/0x280
bio_endio+0x2c2/0x5b0
dec_pending+0x22b/0x440 [dm_mod]
clone_endio+0xe4/0x2c0 [dm_mod]
bio_endio+0x2c2/0x5b0
blk_update_request+0x217/0x940
scsi_end_request+0x6b/0x4d0
scsi_io_completion+0xb7/0x7e0
scsi_finish_command+0x223/0x310
scsi_softirq_done+0x1d5/0x210
blk_mq_complete_request+0x224/0x250
scsi_mq_done+0xc2/0x250
pqi_raid_io_complete+0x5a/0x70 [smartpqi]
pqi_irq_handler+0x150/0x1410 [smartpqi]
__handle_irq_event_percpu+0x90/0x540
handle_irq_event_percpu+0x49/0xd0
handle_irq_event+0x85/0xca
handle_edge_irq+0x13f/0x3e0
do_IRQ+0x86/0x190
<snip>
Since the write is under pool->lock but the read is done as lockless.
Even though the commit 3fb764f0d149 ("mempool: fix and document
synchronization and memory barrier usage") introduced the smp_wmb() and
smp_rmb() pair to improve the situation, it is adequate to protect it
from data races which could lead to a logic bug, so fix it by adding
READ_ONCE() for the read.
Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Link: http://lkml.kernel.org/r/1581446384-2131-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Sat, 15 Aug 2020 00:31:41 +0000 (17:31 -0700)]
mm/list_lru: fix a data race in list_lru_count_one
struct list_lru_one l.nr_items could be accessed concurrently as noticed
by KCSAN,
BUG: KCSAN: data-race in list_lru_count_one / list_lru_isolate_move
write to 0xffffa102789c4510 of 8 bytes by task 823 on cpu 39:
list_lru_isolate_move+0xf9/0x130
list_lru_isolate_move at mm/list_lru.c:180
inode_lru_isolate+0x12b/0x2a0
__list_lru_walk_one+0x122/0x3d0
list_lru_walk_one+0x75/0xa0
prune_icache_sb+0x8b/0xc0
super_cache_scan+0x1b8/0x250
do_shrink_slab+0x256/0x6d0
shrink_slab+0x41b/0x4a0
shrink_node+0x35c/0xd80
balance_pgdat+0x652/0xd90
kswapd+0x396/0x8d0
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
read to 0xffffa102789c4510 of 8 bytes by task 6345 on cpu 56:
list_lru_count_one+0x116/0x2f0
list_lru_count_one at mm/list_lru.c:193
super_cache_count+0xe8/0x170
do_shrink_slab+0x95/0x6d0
shrink_slab+0x41b/0x4a0
shrink_node+0x35c/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 56 PID: 6345 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #4
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
A shattered l.nr_items could affect the shrinker behaviour due to a data
race. Fix it by adding READ_ONCE() for the read. Since the writes are
aligned and up to word-size, assume those are safe from data races to
avoid readability issues of writing WRITE_ONCE(var, var + val).
Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Link: http://lkml.kernel.org/r/1581114679-5488-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Sat, 15 Aug 2020 00:31:37 +0000 (17:31 -0700)]
mm/memcontrol: fix a data race in scan count
struct mem_cgroup_per_node mz.lru_zone_size[zone_idx][lru] could be
accessed concurrently as noticed by KCSAN,
BUG: KCSAN: data-race in lruvec_lru_size / mem_cgroup_update_lru_size
write to 0xffff9c804ca285f8 of 8 bytes by task 50951 on cpu 12:
mem_cgroup_update_lru_size+0x11c/0x1d0
mem_cgroup_update_lru_size at mm/memcontrol.c:1266
isolate_lru_pages+0x6a9/0xf30
shrink_active_list+0x123/0xcc0
shrink_lruvec+0x8fd/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffff9c804ca285f8 of 8 bytes by task 50964 on cpu 95:
lruvec_lru_size+0xbb/0x270
mem_cgroup_get_zone_lru_size at include/linux/memcontrol.h:536
(inlined by) lruvec_lru_size at mm/vmscan.c:326
shrink_lruvec+0x1d0/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_current+0xa6/0x120
alloc_slab_page+0x3b1/0x540
allocate_slab+0x70/0x660
new_slab+0x46/0x70
___slab_alloc+0x4ad/0x7d0
__slab_alloc+0x43/0x70
kmem_cache_alloc+0x2c3/0x420
getname_flags+0x4c/0x230
getname+0x22/0x30
do_sys_openat2+0x205/0x3b0
do_sys_open+0x9a/0xf0
__x64_sys_openat+0x62/0x80
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe
Reported by Kernel Concurrency Sanitizer on:
CPU: 95 PID: 50964 Comm: cc1 Tainted: G W O L 5.5.0-next-20200204+ #6
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
The write is under lru_lock, but the read is done as lockless. The scan
count is used to determine how aggressively the anon and file LRU lists
should be scanned. Load tearing could generate an inefficient heuristic,
so fix it by adding READ_ONCE() for the read.
Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Link: http://lkml.kernel.org/r/20200206034945.2481-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Sat, 15 Aug 2020 00:31:34 +0000 (17:31 -0700)]
mm/page_counter: fix various data races at memsw
Commit 7812d2571b5f ("mm: memcontrol: lockless page counters") could had
memcg->memsw->watermark and memcg->memsw->failcnt been accessed
concurrently as reported by KCSAN,
BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
read to 0xffff8fb18c4cd190 of 8 bytes by task 1081 on cpu 59:
page_counter_try_charge+0x4d/0x150 mm/page_counter.c:138
try_charge+0x131/0xd50 mm/memcontrol.c:2405
__memcg_kmem_charge_memcg+0x58/0x140
__memcg_kmem_charge+0xcc/0x280
__alloc_pages_nodemask+0x1e1/0x450
alloc_pages_current+0xa6/0x120
pte_alloc_one+0x17/0xd0
__pte_alloc+0x3a/0x1f0
copy_p4d_range+0xc36/0x1990
copy_page_range+0x21d/0x360
dup_mmap+0x5f5/0x7a0
dup_mm+0xa2/0x240
copy_process+0x1b3f/0x3460
_do_fork+0xaa/0xa20
__x64_sys_clone+0x13b/0x170
do_syscall_64+0x91/0xb47
entry_SYSCALL_64_after_hwframe+0x49/0xbe
write to 0xffff8fb18c4cd190 of 8 bytes by task 1153 on cpu 120:
page_counter_try_charge+0x5b/0x150 mm/page_counter.c:139
try_charge+0x131/0xd50 mm/memcontrol.c:2405
mem_cgroup_try_charge+0x159/0x460
mem_cgroup_try_charge_delay+0x3d/0xa0
wp_page_copy+0x14d/0x930
do_wp_page+0x107/0x7b0
__handle_mm_fault+0xce6/0xd40
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
BUG: KCSAN: data-race in page_counter_try_charge / page_counter_try_charge
write to 0xffff88809bbf2158 of 8 bytes by task 11782 on cpu 0:
page_counter_try_charge+0x100/0x170 mm/page_counter.c:129
try_charge+0x185/0xbf0 mm/memcontrol.c:2405
__memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
__memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
__alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
read to 0xffff88809bbf2158 of 8 bytes by task 11814 on cpu 1:
page_counter_try_charge+0xef/0x170 mm/page_counter.c:129
try_charge+0x185/0xbf0 mm/memcontrol.c:2405
__memcg_kmem_charge_memcg+0x4a/0xe0 mm/memcontrol.c:2837
__memcg_kmem_charge+0xcf/0x1b0 mm/memcontrol.c:2877
__alloc_pages_nodemask+0x26c/0x310 mm/page_alloc.c:4780
Since watermark could be compared or set to garbage due to a data race
which would change the code logic, fix it by adding a pair of READ_ONCE()
and WRITE_ONCE() in those places.
The "failcnt" counter is tolerant of some degree of inaccuracy and is only
used to report stats, a data race will not be harmful, thus mark it as an
intentional data race using the data_race() macro.
Fixes: 7812d2571b5f ("mm: memcontrol: lockless page counters") Reported-by: syzbot+f36cfe60b1006a94f9dc@syzkaller.appspotmail.com Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: David Hildenbrand <david@redhat.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Link: http://lkml.kernel.org/r/1581519682-23594-1-git-send-email-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Sat, 15 Aug 2020 00:31:31 +0000 (17:31 -0700)]
mm/swapfile: fix and annotate various data races
swap_info_struct si.highest_bit, si.swap_map[offset] and si.flags could
be accessed concurrently separately as noticed by KCSAN,
=== si.highest_bit ===
write to 0xffff8d5abccdc4d4 of 4 bytes by task 5353 on cpu 24:
swap_range_alloc+0x81/0x130
swap_range_alloc at mm/swapfile.c:681
scan_swap_map_slots+0x371/0xb90
get_swap_pages+0x39d/0x5c0
get_swap_page+0xf2/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
read to 0xffff8d5abccdc4d4 of 4 bytes by task 6672 on cpu 70:
scan_swap_map_slots+0x4a6/0xb90
scan_swap_map_slots at mm/swapfile.c:892
get_swap_pages+0x39d/0x5c0
get_swap_page+0xf2/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
Reported by Kernel Concurrency Sanitizer on:
CPU: 70 PID: 6672 Comm: oom01 Tainted: G W L 5.5.0-next-20200205+ #3
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
=== si.swap_map[offset] ===
write to 0xffffbc370c29a64c of 1 bytes by task 6856 on cpu 86:
__swap_entry_free_locked+0x8c/0x100
__swap_entry_free_locked at mm/swapfile.c:1209 (discriminator 4)
__swap_entry_free.constprop.20+0x69/0xb0
free_swap_and_cache+0x53/0xa0
unmap_page_range+0x7f8/0x1d70
unmap_single_vma+0xcd/0x170
unmap_vmas+0x18b/0x220
exit_mmap+0xee/0x220
mmput+0x10e/0x270
do_exit+0x59b/0xf40
do_group_exit+0x8b/0x180
read to 0xffffbc370c29a64c of 1 bytes by task 6855 on cpu 20:
_swap_info_get+0x81/0xa0
_swap_info_get at mm/swapfile.c:1140
free_swap_and_cache+0x40/0xa0
unmap_page_range+0x7f8/0x1d70
unmap_single_vma+0xcd/0x170
unmap_vmas+0x18b/0x220
exit_mmap+0xee/0x220
mmput+0x10e/0x270
do_exit+0x59b/0xf40
do_group_exit+0x8b/0x180
=== si.flags ===
write to 0xffff956c8fc6c400 of 8 bytes by task 6087 on cpu 23:
scan_swap_map_slots+0x6fe/0xb50
scan_swap_map_slots at mm/swapfile.c:887
get_swap_pages+0x39d/0x5c0
get_swap_page+0x377/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1795/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
read to 0xffff956c8fc6c400 of 8 bytes by task 6207 on cpu 63:
_swap_info_get+0x41/0xa0
__swap_info_get at mm/swapfile.c:1114
put_swap_page+0x84/0x490
__remove_mapping+0x384/0x5f0
shrink_page_list+0xff1/0x2870
shrink_inactive_list+0x316/0x880
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
The writes are under si->lock but the reads are not. For si.highest_bit
and si.swap_map[offset], data race could trigger logic bugs, so fix them
by having WRITE_ONCE() for the writes and READ_ONCE() for the reads
except those isolated reads where they compare against zero which a data
race would cause no harm. Thus, annotate them as intentional data races
using the data_race() macro.
For si.flags, the readers are only interested in a single bit where a
data race there would cause no issue there.
struct file_ra_state ra.mmap_miss could be accessed concurrently during
page faults as noticed by KCSAN,
BUG: KCSAN: data-race in filemap_fault / filemap_map_pages
write to 0xffff9b1700a2c1b4 of 4 bytes by task 3292 on cpu 30:
filemap_fault+0x920/0xfc0
do_sync_mmap_readahead at mm/filemap.c:2384
(inlined by) filemap_fault at mm/filemap.c:2486
__xfs_filemap_fault+0x112/0x3e0 [xfs]
xfs_filemap_fault+0x74/0x90 [xfs]
__do_fault+0x9e/0x220
do_fault+0x4a0/0x920
__handle_mm_fault+0xc69/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffff9b1700a2c1b4 of 4 bytes by task 3313 on cpu 32:
filemap_map_pages+0xc2e/0xd80
filemap_map_pages at mm/filemap.c:2625
do_fault+0x3da/0x920
__handle_mm_fault+0xc69/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 32 PID: 3313 Comm: systemd-udevd Tainted: G W L 5.5.0-next-20200210+ #1
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
ra.mmap_miss is used to contribute the readahead decisions, a data race
could be undesirable. Both the read and write is only under non-exclusive
mmap_sem, two concurrent writers could even underflow the counter. Fix
the underflow by writing to a local variable before committing a final
store to ra.mmap_miss given a small inaccuracy of the counter should be
acceptable.
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Tested-by: Qian Cai <cai@lca.pw> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Marco Elver <elver@google.com> Link: http://lkml.kernel.org/r/20200211030134.1847-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Sat, 15 Aug 2020 00:31:24 +0000 (17:31 -0700)]
mm/swap_state: mark various intentional data races
swap_cache_info.* could be accessed concurrently as noticed by
KCSAN,
BUG: KCSAN: data-race in lookup_swap_cache / lookup_swap_cache
write to 0xffffffff85517318 of 8 bytes by task 94138 on cpu 101:
lookup_swap_cache+0x12e/0x460
lookup_swap_cache at mm/swap_state.c:322
do_swap_page+0x112/0xeb0
__handle_mm_fault+0xc7a/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffffffff85517318 of 8 bytes by task 91655 on cpu 100:
lookup_swap_cache+0x117/0x460
lookup_swap_cache at mm/swap_state.c:322
shmem_swapin_page+0xc7/0x9e0
shmem_getpage_gfp+0x2ca/0x16c0
shmem_fault+0xef/0x3c0
__do_fault+0x9e/0x220
do_fault+0x4a0/0x920
__handle_mm_fault+0xc69/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 100 PID: 91655 Comm: systemd-journal Tainted: G W O L 5.5.0-next-20200204+ #6
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
write to 0xffffffff8d717308 of 8 bytes by task 11365 on cpu 87:
__delete_from_swap_cache+0x681/0x8b0
__delete_from_swap_cache at mm/swap_state.c:178
read to 0xffffffff8d717308 of 8 bytes by task 11275 on cpu 53:
__delete_from_swap_cache+0x66e/0x8b0
__delete_from_swap_cache at mm/swap_state.c:178
Both the read and write are done as lockless. Since swap_cache_info.*
are only used to print out counter information, even if any of them
missed a few incremental due to data races, it will be harmless, so just
mark it as an intentional data race using the data_race() macro.
While at it, fix a checkpatch.pl warning,
WARNING: Single statement macros should not use a do {} while (0) loop
Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Link: http://lkml.kernel.org/r/20200207003715.1578-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Qian Cai [Sat, 15 Aug 2020 00:31:20 +0000 (17:31 -0700)]
mm/page_io: mark various intentional data races
struct swap_info_struct si.flags could be accessed concurrently as noticed
by KCSAN,
BUG: KCSAN: data-race in scan_swap_map_slots / swap_readpage
write to 0xffff9c77b80ac400 of 8 bytes by task 91325 on cpu 16:
scan_swap_map_slots+0x6fe/0xb50
scan_swap_map_slots at mm/swapfile.c:887
get_swap_pages+0x39d/0x5c0
get_swap_page+0x377/0x524
add_to_swap+0xe4/0x1c0
shrink_page_list+0x1740/0x2820
shrink_inactive_list+0x316/0x8b0
shrink_lruvec+0x8dc/0x1380
shrink_node+0x317/0xd80
do_try_to_free_pages+0x1f7/0xa10
try_to_free_pages+0x26c/0x5e0
__alloc_pages_slowpath+0x458/0x1290
__alloc_pages_nodemask+0x3bb/0x450
alloc_pages_vma+0x8a/0x2c0
do_anonymous_page+0x170/0x700
__handle_mm_fault+0xc9f/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffff9c77b80ac400 of 8 bytes by task 5422 on cpu 7:
swap_readpage+0x204/0x6a0
swap_readpage at mm/page_io.c:380
read_swap_cache_async+0xa2/0xb0
swapin_readahead+0x6a0/0x890
do_swap_page+0x465/0xeb0
__handle_mm_fault+0xc7a/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 5422 Comm: gmain Tainted: G W O L 5.5.0-next-20200204+ #6
Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
Other reads,
read to 0xffff91ea33eac400 of 8 bytes by task 11276 on cpu 120:
__swap_writepage+0x140/0xc20
__swap_writepage at mm/page_io.c:289
read to 0xffff91ea33eac400 of 8 bytes by task 11264 on cpu 16:
swap_set_page_dirty+0x44/0x1f4
swap_set_page_dirty at mm/page_io.c:442
The write is under &si->lock, but the reads are done as lockless. Since
the reads only check for a specific bit in the flag, it is harmless even
if load tearing happens. Thus, just mark them as intentional data races
using the data_race() macro.
Qian Cai [Sat, 15 Aug 2020 00:31:14 +0000 (17:31 -0700)]
mm/kmemleak: silence KCSAN splats in checksum
Even if KCSAN is disabled for kmemleak, update_checksum() could still call
crc32() (which is outside of kmemleak.c) to dereference object->pointer.
Thus, the value of object->pointer could be accessed concurrently as
noticed by KCSAN,
BUG: KCSAN: data-race in crc32_le_base / do_raw_spin_lock
write to 0xffffb0ea683a7d50 of 4 bytes by task 23575 on cpu 12:
do_raw_spin_lock+0x114/0x200
debug_spin_lock_after at kernel/locking/spinlock_debug.c:91
(inlined by) do_raw_spin_lock at kernel/locking/spinlock_debug.c:115
_raw_spin_lock+0x40/0x50
__handle_mm_fault+0xa9e/0xd00
handle_mm_fault+0xfc/0x2f0
do_page_fault+0x263/0x6f9
page_fault+0x34/0x40
read to 0xffffb0ea683a7d50 of 4 bytes by task 839 on cpu 60:
crc32_le_base+0x67/0x350
crc32_le_base+0x67/0x350:
crc32_body at lib/crc32.c:106
(inlined by) crc32_le_generic at lib/crc32.c:179
(inlined by) crc32_le at lib/crc32.c:197
kmemleak_scan+0x528/0xd90
update_checksum at mm/kmemleak.c:1172
(inlined by) kmemleak_scan at mm/kmemleak.c:1497
kmemleak_scan_thread+0xcc/0xfa
kthread+0x1e0/0x200
ret_from_fork+0x27/0x50
If a shattered value was returned due to a data race, it will be corrected
in the next scan. Thus, let KCSAN ignore all reads in the region to
silence KCSAN in case the write side is non-atomic.
Suggested-by: Marco Elver <elver@google.com> Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Marco Elver <elver@google.com> Acked-by: Catalin Marinas <catalin.marinas@arm.com> Link: http://lkml.kernel.org/r/20200317182754.2180-1-cai@lca.pw Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Xiaoming Ni [Sat, 15 Aug 2020 00:31:07 +0000 (17:31 -0700)]
all arch: remove system call sys_sysctl
Since commit 5b728d4fe46298 ("sysctl: Remove the sysctl system call"),
sys_sysctl is actually unavailable: any input can only return an error.
We have been warning about people using the sysctl system call for years
and believe there are no more users. Even if there are users of this
interface if they have not complained or fixed their code by now they
probably are not going to, so there is no point in warning them any
longer.
So completely remove sys_sysctl on all architectures.
[nixiaoming@huawei.com: s390: fix build error for sys_call_table_emu] Link: http://lkml.kernel.org/r/20200618141426.16884-1-nixiaoming@huawei.com Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Will Deacon <will@kernel.org> [arm/arm64] Acked-by: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnaldo Carvalho de Melo <acme@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Bin Meng <bin.meng@windriver.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: chenzefeng <chenzefeng2@huawei.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christian Brauner <christian@brauner.io> Cc: Chris Zankel <chris@zankel.net> Cc: David Howells <dhowells@redhat.com> Cc: David S. Miller <davem@davemloft.net> Cc: Diego Elio Pettenò <flameeyes@flameeyes.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Dominik Brodowski <linux@dominikbrodowski.net> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Geert Uytterhoeven <geert@linux-m68k.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Kars de Jong <jongk@linux-m68k.org> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Marco Elver <elver@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Cc: Matt Turner <mattst88@gmail.com> Cc: Max Filippov <jcmvbkbc@gmail.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Simek <monstr@monstr.eu> Cc: Miklos Szeredi <mszeredi@redhat.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Naveen N. Rao <naveen.n.rao@linux.vnet.ibm.com> Cc: Nick Piggin <npiggin@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Olof Johansson <olof@lixom.net> Cc: Paul Burton <paulburton@kernel.org> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Ravi Bangoria <ravi.bangoria@linux.ibm.com> Cc: Richard Henderson <rth@twiddle.net> Cc: Rich Felker <dalias@libc.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sargun Dhillon <sargun@sargun.me> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Sudeep Holla <sudeep.holla@arm.com> Cc: Sven Schnelle <svens@stackframe.org> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tony Luck <tony.luck@intel.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yoshinori Sato <ysato@users.sourceforge.jp> Cc: Zhou Yanjie <zhouyanjie@wanyeetech.com> Link: http://lkml.kernel.org/r/20200616030734.87257-1-nixiaoming@huawei.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Randy Dunlap [Sat, 15 Aug 2020 00:30:46 +0000 (17:30 -0700)]
fs: autofs: delete repeated words in comments
Drop duplicated words {the, at} in comments.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Ian Kent <raven@themaw.net> Link: http://lkml.kernel.org/r/20200811021817.24982-1-rdunlap@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mirroring offset_in_page(), this gives you the offset within this
particular page, no matter what size page it is. It optimises down to
offset_in_page() if CONFIG_TRANSPARENT_HUGEPAGE is not set.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-8-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This is like compound_head() but compiles away when
CONFIG_TRANSPARENT_HUGEPAGE is not enabled.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-7-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The thp prefix is more frequently used than hpage and we should be
consistent between the various functions.
[akpm@linux-foundation.org: fix mm/migrate.c]
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function returns the number of bytes in a THP. It is like
page_size(), but compiles to just PAGE_SIZE if CONFIG_TRANSPARENT_HUGEPAGE
is disabled.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-5-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This function returns the order of a transparent huge page. It compiles
to 0 if CONFIG_TRANSPARENT_HUGEPAGE is disabled.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-4-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Give up on the notion that we can remove page-flags.h from mm.h. There
are currently 14 inline functions which use a PageFoo function. Also, two
of the files directly included by mm.h include page-flags.h themselves,
and there are probably more indirect inclusions. So just include it at
the top like any other header file.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-3-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
These are some generic cleanups and improvements, which I would like
merged into mmotm soon. The first one should be a performance improvement
for all users of compound pages, and the others are aimed at getting code
to compile away when CONFIG_TRANSPARENT_HUGEPAGE is disabled (ie small
systems). Also better documented / less confusing than the current prefix
mixture of compound, hpage and thp.
This patch (of 7):
This removes a few instructions from functions which need to know how many
pages are in a compound page. The storage used is either page->mapping on
64-bit or page->index on 32-bit. Both of these are fine to overlay on
tail pages.
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: William Kucharski <william.kucharski@oracle.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-1-willy@infradead.org Link: http://lkml.kernel.org/r/20200629151959.15779-2-willy@infradead.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Greg Kurz [Sat, 15 Aug 2020 00:30:20 +0000 (17:30 -0700)]
mailmap: add entry for Greg Kurz
I had stopped using gkurz@linux.vnet.ibm.com a while back already but this
email address was shutdown last June when I quit IBM. It's about time to
map it to groug@kaod.org.
Kees Cook [Sat, 15 Aug 2020 00:30:14 +0000 (17:30 -0700)]
exec: restore EACCES of S_ISDIR execve()
Patch series "Fix S_ISDIR execve() errno".
Fix an errno change for execve() of directories, noticed by Marc Zyngier.
Along with the fix, include a regression test to avoid seeing this return
in the future.
This patch (of 2):
The return code for attempting to execute a directory has always been
EACCES. Adjust the S_ISDIR exec test to reflect the old errno instead of
the general EISDIR for other kinds of "open" attempts on directories.
Nick Terrell [Sat, 15 Aug 2020 00:30:10 +0000 (17:30 -0700)]
lz4: fix kernel decompression speed
This patch replaces all memcpy() calls with LZ4_memcpy() which calls
__builtin_memcpy() so the compiler can inline it.
LZ4 relies heavily on memcpy() with a constant size being inlined. In x86
and i386 pre-boot environments memcpy() cannot be inlined because memcpy()
doesn't get defined as __builtin_memcpy().
An equivalent patch has been applied upstream so that the next import
won't lose this change [1].
I've measured the kernel decompression speed using QEMU before and after
this patch for the x86_64 and i386 architectures. The speed-up is about
10x as shown below.
Code Arch Kernel Size Time Speed
v5.8 x86_64 11504832 B 148 ms 79 MB/s
patch x86_64 11503872 B 13 ms 885 MB/s
v5.8 i386 9621216 B 91 ms 106 MB/s
patch i386 9620224 B 10 ms 962 MB/s
I also measured the time to decompress the initramfs on x86_64, i386, and
arm. All three show the same decompression speed before and after, as
expected.
Sonny reported that one of their tests started failing on the latest
kernel on their Chrome OS platform. The root cause is that the above
commit removed the protection line of empty zone, while the parser used in
the test relies on the protection line to mark the end of each zone.
Let's revert it to avoid breaking userspace testing or applications.
Fixes: 9b9baaf8384add ("mm/vmstat.c: do not show lowmem reserve protection information of empty zone)" Reported-by: Sonny Rao <sonnyrao@chromium.org> Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Cc: <stable@vger.kernel.org> [5.8.x] Link: http://lkml.kernel.org/r/20200811075412.12872-1-bhe@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 14 Aug 2020 21:26:08 +0000 (14:26 -0700)]
Merge tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timekeeping updates from Thomas Gleixner:
"A set of timekeeping/VDSO updates:
- Preparatory work to allow S390 to switch over to the generic VDSO
implementation.
S390 requires that the VDSO data pointer is handed in to the
counter read function when time namespace support is enabled.
Adding the pointer is a NOOP for all other architectures because
the compiler is supposed to optimize that out when it is unused in
the architecture specific inline. The change also solved a similar
problem for MIPS which fortunately has time namespaces not yet
enabled.
S390 needs to update clock related VDSO data independent of the
timekeeping updates. This was solved so far with yet another
sequence counter in the S390 implementation. A better solution is
to utilize the already existing VDSO sequence count for this. The
core code now exposes helper functions which allow to serialize
against the timekeeper code and against concurrent readers.
S390 needs extra data for their clock readout function. The initial
common VDSO data structure did not provide a way to add that. It
now has an embedded architecture specific struct embedded which
defaults to an empty struct.
Doing this now avoids tree dependencies and conflicts post rc1 and
allows all other architectures which work on generic VDSO support
to work from a common upstream base.
- A trivial comment fix"
* tag 'timers-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
time: Delete repeated words in comments
lib/vdso: Allow to add architecture-specific vdso data
timekeeping/vsyscall: Provide vdso_update_begin/end()
vdso/treewide: Add vdso_data pointer argument to __arch_get_hw_counter()
Linus Torvalds [Fri, 14 Aug 2020 21:17:51 +0000 (14:17 -0700)]
Merge tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull more timer updates from Thomas Gleixner:
"A set of posix CPU timer changes which allows to defer the heavy work
of posix CPU timers into task work context. The tick interrupt is
reduced to a quick check which queues the work which is doing the
heavy lifting before returning to user space or going back to guest
mode. Moving this out is deferring the signal delivery slightly but
posix CPU timers are inaccurate by nature as they depend on the tick
so there is no real damage. The relevant test cases all passed.
This lifts the last offender for RT out of the hard interrupt context
tick handler, but it also has the general benefit that the actual
heavy work is accounted to the task/process and not to the tick
interrupt itself.
Further optimizations are possible to break long sighand lock hold and
interrupt disabled (on !RT kernels) times when a massive amount of
posix CPU timers (which are unpriviledged) is armed for a
task/process.
This is currently only enabled for x86 because the architecture has to
ensure that task work is handled in KVM before entering a guest, which
was just established for x86 with the new common entry/exit code which
got merged post 5.8 and is not the case for other KVM architectures"
* tag 'timers-core-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86: Select POSIX_CPU_TIMERS_TASK_WORK
posix-cpu-timers: Provide mechanisms to defer timer handling to task_work
posix-cpu-timers: Split run_posix_cpu_timers()
Linus Torvalds [Fri, 14 Aug 2020 21:14:28 +0000 (14:14 -0700)]
Merge tag 'irq-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq fixes from Thomas Gleixner:
"Two fixes in the core interrupt code which ensure that all error exits
unlock the descriptor lock"
* tag 'irq-urgent-2020-08-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq: Unlock irq descriptor after errors
genirq/PM: Always unlock IRQ descriptor in rearm_wake_irq()
Linus Torvalds [Fri, 14 Aug 2020 21:04:53 +0000 (14:04 -0700)]
Merge tag 'for-linus' of git://github.com/openrisc/linux
Pull OpenRISC updates from Stafford Horne:
"A few patches all over the place during this cycle, mostly bug and
sparse warning fixes for OpenRISC, but a few enhancements too. Note,
there are 2 non OpenRISC specific fixups.
Non OpenRISC fixes:
- In init we need to align the init_task correctly to fix an issue
with MUTEX_FLAGS, reviewed by Peter Z. No one picked this up so I
kept it on my tree.
- In asm-generic/io.h I fixed up some sparse warnings, OK'd by Arnd.
Arnd asked to merge it via my tree.
OpenRISC fixes:
- Many fixes for OpenRISC sprase warnings.
- Add support OpenRISC SMP tlb flushing rather than always flushing
the entire TLB on every CPU.
- Fix bug when dumping stack via /proc/xxx/stack of user threads"
* tag 'for-linus' of git://github.com/openrisc/linux:
openrisc: uaccess: Add user address space check to access_ok
openrisc: signal: Fix sparse address space warnings
openrisc: uaccess: Remove unused macro __addr_ok
openrisc: uaccess: Use static inline function in access_ok
openrisc: uaccess: Fix sparse address space warnings
openrisc: io: Fixup defines and move include to the end
asm-generic/io.h: Fix sparse warnings on big-endian architectures
openrisc: Implement proper SMP tlb flushing
openrisc: Fix oops caused when dumping stack
openrisc: Add support for external initrd images
init: Align init_task to avoid conflict with MUTEX_FLAGS
openrisc: fix __user in raw_copy_to_user()'s prototype
Linus Torvalds [Fri, 14 Aug 2020 20:40:27 +0000 (13:40 -0700)]
Merge tag 'powerpc-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"One fix for a boot crash on some platforms introduced by the recent
pkey refactoring.
Thanks to Christian Zigotzky and Aneesh Kumar K.V"
* tag 'powerpc-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/pkeys: Fix boot failures with Nemo board (A-EON AmigaOne X1000)
Linus Torvalds [Fri, 14 Aug 2020 20:34:37 +0000 (13:34 -0700)]
Merge tag 'for-linus-5.9-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull more xen updates from Juergen Gross:
- Remove support for running as 32-bit Xen PV-guest.
32-bit PV guests are rarely used, are lacking security fixes for
Meltdown, and can be easily replaced by PVH mode. Another series for
doing more cleanup will follow soon (removal of 32-bit-only pvops
functionality).
- Fixes and additional features for the Xen display frontend driver.
* tag 'for-linus-5.9-rc1b-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
drm/xen-front: Pass dumb buffer data offset to the backend
xen: Sync up with the canonical protocol definition in Xen
drm/xen-front: Add YUYV to supported formats
drm/xen-front: Fix misused IS_ERR_OR_NULL checks
xen/gntdev: Fix dmabuf import with non-zero sgt offset
x86/xen: drop tests for highmem in pv code
x86/xen: eliminate xen-asm_64.S
x86/xen: remove 32-bit Xen PV guest support
Linus Torvalds [Fri, 14 Aug 2020 20:31:25 +0000 (13:31 -0700)]
Merge tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux
Pull hyper-v fixes from Wei Liu:
- fix oops reporting on Hyper-V
- make objtool happy
* tag 'hyperv-fixes-signed' of git://git.kernel.org/pub/scm/linux/kernel/git/hyperv/linux:
x86/hyperv: Make hv_setup_sched_clock inline
Drivers: hv: vmbus: Only notify Hyper-V for die events that are oops
This can happen if ptrace() or sigreturn() pokes an LDT selector into FS
or GS for a task with no LDT and something tries to read the base before
a return to usermode notices the bad selector and fixes it.
The fix is to make sure ldt pointer is not NULL.
Fixes: 66cee9012dd8 ("x86/fsgsbase/64: Fix ptrace() to read the FS/GS base accurately") Co-developed-by: Jann Horn <jannh@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Andy Lutomirski <luto@kernel.org> Cc: Chang S. Bae <chang.seok.bae@intel.com> Cc: Andy Lutomirski <luto@amacapital.net> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Gerst <brgerst@gmail.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Markus T Metzger <markus.t.metzger@intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ravi Shankar <ravi.v.shankar@intel.com> Cc: Rik van Riel <riel@surriel.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 14 Aug 2020 18:07:02 +0000 (11:07 -0700)]
Merge tag 'modules-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux
Pull module updates from Jessica Yu:
"The most important change would be Christoph Hellwig's patch
implementing proprietary taint inheritance, in an effort to discourage
the creation of GPL "shim" modules that interface between GPL symbols
and proprietary symbols.
Summary:
- Have modules that use symbols from proprietary modules inherit the
TAINT_PROPRIETARY_MODULE taint, in an effort to prevent GPL shim
modules that are used to circumvent _GPL exports. These are modules
that claim to be GPL licensed while also using symbols from
proprietary modules. Such modules will be rejected while non-GPL
modules will inherit the proprietary taint.
- Module export space cleanup. Unexport symbols that are unused
outside of module.c or otherwise used in only built-in code"
* tag 'modules-for-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
modules: inherit TAINT_PROPRIETARY_MODULE
modules: return licensing information from find_symbol
modules: rename the licence field in struct symsearch to license
modules: unexport __module_address
modules: unexport __module_text_address
modules: mark each_symbol_section static
modules: mark find_symbol static
modules: mark ref_module static
modules: linux/moduleparam.h: drop duplicated word in a comment
Linus Torvalds [Fri, 14 Aug 2020 18:04:45 +0000 (11:04 -0700)]
Merge tag 'kconfig-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull Kconfig updates from Masahiro Yamada:
- remove '---help---' keyword support
- fix mouse events for 'menuconfig' symbols in search view of qconf
- code cleanups of qconf
* tag 'kconfig-v5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (24 commits)
kconfig: qconf: move setOptionMode() to ConfigList from ConfigView
kconfig: qconf: do not limit the pop-up menu to the first row
kconfig: qconf: refactor icon setups
kconfig: qconf: remove unused voidPix, menuInvPix
kconfig: qconf: remove ConfigItem::text/setText
kconfig: qconf: remove ConfigList::addColumn/removeColumn
kconfig: qconf: remove ConfigItem::pixmap/setPixmap
kconfig: qconf: drop more localization code
kconfig: qconf: remove 'parent' from ConfigList::updateMenuList()
kconfig: qconf: remove unused argument from ConfigView::updateList()
kconfig: qconf: remove unused argument from ConfigList::updateList()
kconfig: qconf: omit parent to QHBoxLayout()
kconfig: qconf: remove name from ConfigSearchWindow constructor
kconfig: qconf: remove unused ConfigList::listView()
kconfig: qconf: overload addToolBar() to create and insert toolbar
kconfig: qconf: remove toolBar from ConfigMainWindow members
kconfig: qconf: use 'menu' variable for (QMenu *)
kconfig: qconf: do not use 'menu' variable for (QMenuBar *)
kconfig: qconf: remove ->addSeparator() to menuBar
kconfig: add 'static' to some file-local data
...
Masahiro Yamada [Fri, 7 Aug 2020 09:19:08 +0000 (18:19 +0900)]
kconfig: qconf: do not limit the pop-up menu to the first row
If you right-click the first row in the option tree, the pop-up menu
shows up, but if you right-click the second row or below, the event
is ignored due to the following check:
if (e->y() <= header()->geometry().bottom()) {
Perhaps, the intention was to show the pop-menu only when the tree
header was right-clicked, but this handler is not called in that case.
Since the origin of e->y() starts from the bottom of the header,
this check is odd.
Going forward, you can right-click anywhere in the tree to get the
pop-up menu.
Masahiro Yamada [Fri, 7 Aug 2020 09:19:07 +0000 (18:19 +0900)]
kconfig: qconf: refactor icon setups
These icon data are used by ConfigItem, but stored in each instance
of ConfigView. There is no point to keep the same data in each of 3
instances, "menu", "config", and "search".
Move the icon data to the more relevant ConfigItem class, and make
them static members.
Masahiro Yamada [Fri, 7 Aug 2020 09:18:55 +0000 (18:18 +0900)]
kconfig: qconf: overload addToolBar() to create and insert toolbar
Use the overloaded function, addToolBar(const QString &title)
to create a QToolBar object, setting its window title, and inserts
it into the toolbar area.
Linus Torvalds [Fri, 14 Aug 2020 01:41:00 +0000 (18:41 -0700)]
Merge branch 'i2c/for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c updates from Wolfram Sang:
- bus recovery can now be given a pinctrl handle and the I2C core will
do all the steps to switch to/from GPIO which can save quite some
boilerplate code from drivers
- "fallthrough" conversion
- driver updates, mostly ID additions
* 'i2c/for-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (32 commits)
i2c: iproc: fix race between client unreg and isr
i2c: eg20t: use generic power management
i2c: eg20t: Drop PCI wakeup calls from .suspend/.resume
i2c: mediatek: Fix i2c_spec_values description
i2c: mediatek: Add i2c compatible for MediaTek MT8192
dt-bindings: i2c: update bindings for MT8192 SoC
i2c: mediatek: Add access to more than 8GB dram in i2c driver
i2c: mediatek: Add apdma sync in i2c driver
i2c: i801: Add support for Intel Tiger Lake PCH-H
i2c: i801: Add support for Intel Emmitsburg PCH
i2c: bcm2835: Replace HTTP links with HTTPS ones
Documentation: i2c: dev: 'block process call' is supported
i2c: at91: Move to generic GPIO bus recovery
i2c: core: treat EPROBE_DEFER when acquiring SCL/SDA GPIOs
i2c: core: add generic I2C GPIO recovery
dt-bindings: i2c: add generic properties for GPIO bus recovery
i2c: rcar: avoid race when unregistering slave
i2c: tegra: Avoid tegra_i2c_init_dma() for Tegra210 vi i2c
i2c: tegra: Fix runtime resume to re-init VI I2C
i2c: tegra: Fix the error path in tegra_i2c_runtime_resume
...
Tonghao Zhang [Wed, 12 Aug 2020 09:56:39 +0000 (17:56 +0800)]
net: openvswitch: introduce common code for flushing flows
To avoid some issues, for example RCU usage warning and double free,
we should flush the flows under ovs_lock. This patch refactors
table_instance_destroy and introduces table_instance_flow_flush
which can be invoked by __dp_destroy or ovs_flow_tbl_flush.
Fixes: 67df6806f2fb ("net: openvswitch: fix possible memleak on destroy flow-table") Reported-by: Johan Knöös <jknoos@google.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2020-August/050489.html Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
John Ogness [Thu, 13 Aug 2020 19:39:25 +0000 (21:45 +0206)]
af_packet: TPACKET_V3: fix fill status rwlock imbalance
After @blk_fill_in_prog_lock is acquired there is an early out vnet
situation that can occur. In that case, the rwlock needs to be
released.
Also, since @blk_fill_in_prog_lock is only acquired when @tp_version
is exactly TPACKET_V3, only release it on that exact condition as
well.
And finally, add sparse annotation so that it is clearer that
prb_fill_curr_block() and prb_clear_blk_fill_status() are acquiring
and releasing @blk_fill_in_prog_lock, respectively. sparse is still
unable to understand the balance, but the warnings are now on a
higher level that make more sense.
Fixes: 39d698d2d507 ("af_packet: TPACKET_V3: replace busy-wait loop") Signed-off-by: John Ogness <john.ogness@linutronix.de> Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Eric Dumazet [Thu, 13 Aug 2020 17:06:43 +0000 (10:06 -0700)]
random32: add a tracepoint for prandom_u32()
There has been some heat around prandom_u32() lately, and some people
were wondering if there was a simple way to determine how often
it was used, before considering making it maybe 10 times more expensive.
This tracepoint exports the generated pseudo random value.
Tested:
perf list | grep prandom_u32
random:prandom_u32 [Tracepoint event]
perf record -a [-g] [-C1] -e random:prandom_u32 sleep 1
[ perf record: Woken up 0 times to write data ]
[ perf record: Captured and wrote 259.748 MB perf.data (924087 samples) ]
Linus Torvalds [Thu, 13 Aug 2020 20:57:45 +0000 (13:57 -0700)]
Merge tag 'docs-5.9-2' of git://git.lwn.net/linux
Pull documentation fixes from Jonathan Corbet:
"A handful of obvious fixes that wandered in during the merge window"
* tag 'docs-5.9-2' of git://git.lwn.net/linux:
Documentation/locking/locktypes: fix the typo
doc/zh_CN: resolve undefined label warning in admin-guide index
doc/zh_CN: fix title heading markup in admin-guide cpu-load
docs: remove the 2.6 "Upgrading I2C Drivers" guide
docs: Correct the release date of 5.2 stable
mailmap: Update comments for with format and more detalis
docs: cdrom: Fix a typo and rst markup
Doc: admin-guide: use correct legends in kernel-parameters.txt
Documentation/features: refresh RISC-V arch support files
documentation: coccinelle: Improve command example for make C={1,2}
Core-api: Documentation: Replace deprecated :c:func: Usage
Dev-tools: Documentation: Replace deprecated :c:func: Usage
Filesystems: Documentation: Replace deprecated :c:func: Usage
docs: trace: fix a typo
Linus Torvalds [Thu, 13 Aug 2020 19:38:32 +0000 (12:38 -0700)]
Merge tag 's390-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Heiko Carstens:
- Allow s390 debug feature to handle finally more than 256 CPU numbers,
instead of truncating the most significant bits.
- Improve THP splitting required by qemu processes by making use of
walk_page_vma() instead of calling follow_page() for every single
page within each vma.
- Add missing ZCRYPT dependency to VFIO_AP to fix potential compile
problems.
- Remove not required select CLOCKSOURCE_VALIDATE_LAST_CYCLE again.
- Set node distance to LOCAL_DISTANCE instead of 0, since e.g. libnuma
translates a node distance of 0 to "no NUMA support available".
- Couple of other minor fixes and improvements.
* tag 's390-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/numa: move code to arch/s390/kernel
s390/time: remove select CLOCKSOURCE_VALIDATE_LAST_CYCLE again
s390/debug: debug feature version 3
s390/Kconfig: add missing ZCRYPT dependency to VFIO_AP
s390/numa: set node distance to LOCAL_DISTANCE
s390/pkey: remove redundant variable initialization
s390/test_unwind: fix possible memleak in test_unwind()
s390/gmap: improve THP splitting
s390/atomic: circumvent gcc 10 build regression
Linus Torvalds [Thu, 13 Aug 2020 19:26:18 +0000 (12:26 -0700)]
Merge tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull more btrfs updates from David Sterba:
"One minor update, the rest are fixes that have arrived a bit late for
the first batch. There are also some recent fixes for bugs that were
discovered during the merge window and pop up during testing.
User visible change:
- show correct subvolume path in /proc/mounts for bind mounts
Fixes:
- fix compression messages when remounting with different level or
compression algorithm
- tree-log: fix some memory leaks on error handling paths
- restore I_VERSION on remount
- fix return values and error code mixups
- fix umount crash with quotas enabled when removing sysfs files
- fix trim range on a shrunk device"
* tag 'for-5.9-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
btrfs: trim: fix underflow in trim length to prevent access beyond device boundary
btrfs: fix return value mixup in btrfs_get_extent
btrfs: sysfs: fix NULL pointer dereference at btrfs_sysfs_del_qgroups()
btrfs: check correct variable after allocation in btrfs_backref_iter_alloc
btrfs: make sure SB_I_VERSION doesn't get unset by remount
btrfs: fix memory leaks after failure to lookup checksums during inode logging
btrfs: don't show full path of bind mounts in subvol=
btrfs: fix messages after changing compression level by remount
btrfs: only search for left_info if there is no right_info in try_merge_free_space
btrfs: inode: fix NULL pointer dereference if inode doesn't need compression
Linus Torvalds [Thu, 13 Aug 2020 19:22:19 +0000 (12:22 -0700)]
Merge tag 'xfs-5.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux
Pull xfs fixes from Darrick Wong:
"Two small fixes that have come in during the past week:
- Fix duplicated words in comments
- Fix an ubsan complaint about null pointer arithmetic"
* tag 'xfs-5.9-merge-8' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: Fix UBSAN null-ptr-deref in xfs_sysfs_init
xfs: delete duplicated words + other fixes
Linus Torvalds [Thu, 13 Aug 2020 19:18:07 +0000 (12:18 -0700)]
Merge tag 'exfat-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat
Pull exfat updates from Namjae Jeon:
- don't clear MediaFailure and VolumeDirty bit in volume flags if these
were already set before mounting
- write multiple dirty buffers at once in sync mode
- remove unneeded EXFAT_SB_DIRTY bit set
* tag 'exfat-for-5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/linkinjeon/exfat:
exfat: retain 'VolumeFlags' properly
exfat: optimize exfat_zeroed_cluster()
exfat: add error check when updating dir-entries
exfat: write multiple sectors at once
exfat: remove EXFAT_SB_DIRTY flag
Johannes Weiner [Thu, 13 Aug 2020 14:40:54 +0000 (10:40 -0400)]
mm: memcontrol: fix warning when allocating the root cgroup
Commit 9209067f96cf ("mm: memcg: charge memcg percpu memory to the
parent cgroup") adds memory tracking to the memcg kernel structures
themselves to make cgroups liable for the memory they are consuming
through the allocation of child groups (which can be significant).
This code is a bit awkward as it's spread out through several functions:
The outermost function does memalloc_use_memcg(parent) to set up
current->active_memcg, which designates which cgroup to charge, and the
inner functions pass GFP_ACCOUNT to request charging for specific
allocations. To make sure this dependency is satisfied at all times -
to make sure we don't randomly charge whoever is calling the functions -
the inner functions warn on !current->active_memcg.
However, this triggers a false warning when the root memcg itself is
allocated. No parent exists in this case, and so current->active_memcg
is rightfully NULL. It's a false positive, not indicative of a bug.
Delete the warnings for now, we can revisit this later.
Fixes: 9209067f96cf ("mm: memcg: charge memcg percpu memory to the parent cgroup") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Roman Gushchin <guro@fb.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
drm/xen-front: Pass dumb buffer data offset to the backend
While importing a dmabuf it is possible that the data of the buffer
is put with offset which is indicated by the SGT offset.
Respect the offset value and forward it to the backend.
xen: Sync up with the canonical protocol definition in Xen
This is the sync up with the canonical definition of the
display protocol in Xen.
1. Add protocol version as an integer
Version string, which is in fact an integer, is hard to handle in the
code that supports different protocol versions. To simplify that
also add the version as an integer.
2. Pass buffer offset with XENDISPL_OP_DBUF_CREATE
There are cases when display data buffer is created with non-zero
offset to the data start. Handle such cases and provide that offset
while creating a display buffer.
3. Add XENDISPL_OP_GET_EDID command
Add an optional request for reading Extended Display Identification
Data (EDID) structure which allows better configuration of the
display connectors over the configuration set in XenStore.
With this change connectors may have multiple resolutions defined
with respect to detailed timing definitions and additional properties
normally provided by displays.
If this request is not supported by the backend then visible area
is defined by the relevant XenStore's "resolution" property.
If backend provides extended display identification data (EDID) with
XENDISPL_OP_GET_EDID request then EDID values must take precedence
over the resolutions defined in XenStore.
xen/gntdev: Fix dmabuf import with non-zero sgt offset
It is possible that the scatter-gather table during dmabuf import has
non-zero offset of the data, but user-space doesn't expect that.
Fix this by failing the import, so user-space doesn't access wrong data.
Guenter Roeck [Tue, 11 Aug 2020 18:00:12 +0000 (11:00 -0700)]
genirq: Unlock irq descriptor after errors
In irq_set_irqchip_state(), the irq descriptor is not unlocked after an
error is encountered. While that should never happen in practice, a buggy
driver may trigger it. This would result in a lockup, so fix it.
Fixes: 9189bdaf2a72 ("genirq: Check irq_data_get_irq_chip() return value before use") Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20200811180012.80269-1-linux@roeck-us.net
Ondrej Mosnacek [Wed, 12 Aug 2020 12:58:25 +0000 (14:58 +0200)]
crypto: algif_aead - fix uninitialized ctx->init
In skcipher_accept_parent_nokey() the whole af_alg_ctx structure is
cleared by memset() after allocation, so add such memset() also to
aead_accept_parent_nokey() so that the new "init" field is also
initialized to zero. Without that the initial ctx->init checks might
randomly return true and cause errors.
While there, also remove the redundant zero assignments in both
functions.
Found via libkcapi testsuite.
Cc: Stephan Mueller <smueller@chronox.de> Fixes: b7dad8488f83 ("crypto: algif_aead - Only wake up when ctx->more is zero") Suggested-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Linus Torvalds [Thu, 13 Aug 2020 00:17:00 +0000 (17:17 -0700)]
Merge tag 'rtc-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux
Pull RTC updates from Alexandre Belloni:
"Not much this cycle - mostly non urgent driver fixes:
- ds1374: use watchdog core
- pcf2127: add alarm and pcf2129 support"
* tag 'rtc-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux:
rtc: pcf2127: fix alarm handling
rtc: pcf2127: add alarm support
rtc: pcf2127: add pca2129 device id
rtc: max77686: Fix wake-ups for max77620
rtc: ds1307: provide an indication that the watchdog has fired
rtc: ds1374: remove unused define
rtc: ds1374: fix RTC_DRV_DS1374_WDT dependencies
rtc: cleanup obsolete comment about struct rtc_class_ops
rtc: pl031: fix set_alarm by adding back call to alarm_irq_enable
rtc: ds1374: wdt: Use watchdog core for watchdog part
rtc: Replace HTTP links with HTTPS ones
rtc: goldfish: Enable interrupt in set_alarm() when necessary
rtc: max77686: Do not allow interrupt to fire before system resume
rtc: imxdi: fix trivial typos
rtc: cpcap: fix range
====================
net: stmmac: Fix multicast filter on IPQ806x
This pair of patches are the result of discovering a failure to
correctly receive IPv6 multicast packets on such a device (in particular
DHCPv6 requests and RA solicitations). Putting the device into
promiscuous mode, or allmulti, both resulted in such packets correctly
being received. Examination of the vendor driver (nss-gmac from the
qsdk) shows that it does not enable the multicast filter and instead
falls back to allmulti.
Extend the base dwmac1000 driver to fall back when there's no suitable
hardware filter, and update the ipq806x platform to request this.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
The IPQ806x does not appear to have a functional multicast ethernet
address filter. This was observed as a failure to correctly receive IPv6
packets on a LAN to the all stations address. Checking the vendor driver
shows that it does not attempt to enable the multicast filter and
instead falls back to receiving all multicast packets, internally
setting ALLMULTI.
Use the new fallback support in the dwmac1000 driver to correctly
achieve the same with the mainline IPQ806x driver. Confirmed to fix IPv6
functionality on an RB3011 router.
Cc: stable@vger.kernel.org Signed-off-by: Jonathan McDowell <noodles@earth.li> Signed-off-by: David S. Miller <davem@davemloft.net>
net: stmmac: dwmac1000: provide multicast filter fallback
If we don't have a hardware multicast filter available then instead of
silently failing to listen for the requested ethernet broadcast
addresses fall back to receiving all multicast packets, in a similar
fashion to other drivers with no multicast filter.
Cc: stable@vger.kernel.org Signed-off-by: Jonathan McDowell <noodles@earth.li> Signed-off-by: David S. Miller <davem@davemloft.net>
Dhananjay Phadke [Tue, 11 Aug 2020 00:42:40 +0000 (17:42 -0700)]
i2c: iproc: fix race between client unreg and isr
When i2c client unregisters, synchronize irq before setting
iproc_i2c->slave to NULL.
(1) disable_irq()
(2) Mask event enable bits in control reg
(3) Erase slave address (avoid further writes to rx fifo)
(4) Flush tx and rx FIFOs
(5) Clear pending event (interrupt) bits in status reg
(6) enable_irq()
(7) Set client pointer to NULL
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000318
Johannes Berg [Wed, 12 Aug 2020 19:08:53 +0000 (21:08 +0200)]
ipv4: tunnel: fix compilation on ARCH=um
With certain configurations, a 64-bit ARCH=um errors
out here with an unknown csum_ipv6_magic() function.
Include the right header file to always have it.
Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
vsock: fix potential null pointer dereference in vsock_poll()
syzbot reported this issue where in the vsock_poll() we find the
socket state at TCP_ESTABLISHED, but 'transport' is null:
general protection fault, probably for non-canonical address 0xdffffc0000000012: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000090-0x0000000000000097]
CPU: 0 PID: 8227 Comm: syz-executor.2 Not tainted 5.8.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:vsock_poll+0x75a/0x8e0 net/vmw_vsock/af_vsock.c:1038
Call Trace:
sock_poll+0x159/0x460 net/socket.c:1266
vfs_poll include/linux/poll.h:90 [inline]
do_pollfd fs/select.c:869 [inline]
do_poll fs/select.c:917 [inline]
do_sys_poll+0x607/0xd40 fs/select.c:1011
__do_sys_poll fs/select.c:1069 [inline]
__se_sys_poll fs/select.c:1057 [inline]
__x64_sys_poll+0x18c/0x440 fs/select.c:1057
do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:384
entry_SYSCALL_64_after_hwframe+0x44/0xa9
This issue can happen if the TCP_ESTABLISHED state is set after we read
the vsk->transport in the vsock_poll().
We could put barriers to synchronize, but this can only happen during
connection setup, so we can simply check that 'transport' is valid.
Fixes: 377128ab2681 ("vsock: add multi-transports support") Reported-and-tested-by: syzbot+a61bac2fcc1a7c6623fe@syzkaller.appspotmail.com Signed-off-by: Stefano Garzarella <sgarzare@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Edward Cree [Wed, 12 Aug 2020 09:32:49 +0000 (10:32 +0100)]
sfc: fix ef100 design-param checking
The handling of the RXQ/TXQ size granularity design-params had two
problems: it had a 64-bit divide that didn't build on 32-bit platforms,
and it could divide by zero if the NIC supplied 0 as the value of the
design-param. Fix both by checking for 0 and for a granularity bigger
than our min-size; if the granularity <= EFX_MIN_DMAQ_SIZE then it fits
in 32 bits, so we can cast it to u32 for the divide.
Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Edward Cree <ecree@solarflare.com> Reviewed-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: David S. Miller <davem@davemloft.net>
Linus Torvalds [Wed, 12 Aug 2020 19:51:31 +0000 (12:51 -0700)]
Merge tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-client
Pull ceph updates from Ilya Dryomov:
"Xiubo has completed his work on filesystem client metrics, they are
sent to all available MDSes once per second now.
Other than that, we have a lot of fixes and cleanups all around the
filesystem, including a tweak to cut down on MDS request resends in
multi-MDS setups from Yanhu and fixups for SELinux symlink labeling
and MClientSession message decoding from Jeff"
* tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-client: (22 commits)
ceph: handle zero-length feature mask in session messages
ceph: use frag's MDS in either mode
ceph: move sb->wb_pagevec_pool to be a global mempool
ceph: set sec_context xattr on symlink creation
ceph: remove redundant initialization of variable mds
ceph: fix use-after-free for fsc->mdsc
ceph: remove unused variables in ceph_mdsmap_decode()
ceph: delete repeated words in fs/ceph/
ceph: send client provided metric flags in client metadata
ceph: periodically send perf metrics to MDSes
ceph: check the sesion state and return false in case it is closed
libceph: replace HTTP links with HTTPS ones
ceph: remove unnecessary cast in kfree()
libceph: just have osd_req_op_init() return a pointer
ceph: do not access the kiocb after aio requests
ceph: clean up and optimize ceph_check_delayed_caps()
ceph: fix potential mdsc use-after-free crash
ceph: switch to WARN_ON_ONCE in encode_supported_features()
ceph: add global total_caps to count the mdsc's total caps number
ceph: add check_session_state() helper and make it global
...
Linus Torvalds [Wed, 12 Aug 2020 19:41:15 +0000 (12:41 -0700)]
Merge branch 'parisc-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux
Pull more parisc updates from Helge Deller:
- Oscar Carter contributed a patch which fixes parisc's usage of
dereference_function_descriptor() and thus will allow using the
-Wcast-function-type compiler option in the top-level Makefile
- Sven Schnelle fixed a bug in the SBA code to prevent crashes during
kexec
- John David Anglin provided implementations for __smp_store_release()
and __smp_load_acquire barriers() which avoids using the sync
assembler instruction and thus speeds up barrier paths
- Some whitespace cleanups in parisc's atomic.h header file
* 'parisc-5.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux:
parisc: Implement __smp_store_release and __smp_load_acquire barriers
parisc: mask out enable and reserved bits from sba imask
parisc: Whitespace cleanups in atomic.h
parisc/kernel/ftrace: Remove function callback casts
sections.h: dereference_function_descriptor() returns void pointer
Linus Torvalds [Wed, 12 Aug 2020 19:25:06 +0000 (12:25 -0700)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull more KVM updates from Paolo Bonzini:
"PPC:
- Improvements and bugfixes for secure VM support, giving reduced
startup time and memory hotplug support.
- Locking fixes in nested KVM code
- Increase number of guests supported by HV KVM to 4094
- Preliminary POWER10 support
ARM:
- Split the VHE and nVHE hypervisor code bases, build the EL2 code
separately, allowing for the VHE code to now be built with
instrumentation
- Level-based TLB invalidation support
- Restructure of the vcpu register storage to accomodate the NV code
- Pointer Authentication available for guests on nVHE hosts
- Simplification of the system register table parsing
- MMU cleanups and fixes
- A number of post-32bit cleanups and other fixes
MIPS:
- compilation fixes
x86:
- bugfixes
- support for the SERIALIZE instruction"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (70 commits)
KVM: MIPS/VZ: Fix build error caused by 'kvm_run' cleanup
x86/kvm/hyper-v: Synic default SCONTROL MSR needs to be enabled
MIPS: KVM: Convert a fallthrough comment to fallthrough
MIPS: VZ: Only include loongson_regs.h for CPU_LOONGSON64
x86: Expose SERIALIZE for supported cpuid
KVM: x86: Don't attempt to load PDPTRs when 64-bit mode is enabled
KVM: arm64: Move S1PTW S2 fault logic out of io_mem_abort()
KVM: arm64: Don't skip cache maintenance for read-only memslots
KVM: arm64: Handle data and instruction external aborts the same way
KVM: arm64: Rename kvm_vcpu_dabt_isextabt()
KVM: arm: Add trace name for ARM_NISV
KVM: arm64: Ensure that all nVHE hyp code is in .hyp.text
KVM: arm64: Substitute RANDOMIZE_BASE for HARDEN_EL2_VECTORS
KVM: arm64: Make nVHE ASLR conditional on RANDOMIZE_BASE
KVM: PPC: Book3S HV: Rework secure mem slot dropping
KVM: PPC: Book3S HV: Move kvmppc_svm_page_out up
KVM: PPC: Book3S HV: Migrate hot plugged memory
KVM: PPC: Book3S HV: In H_SVM_INIT_DONE, migrate remaining normal-GFNs to secure-GFNs
KVM: PPC: Book3S HV: Track the state GFNs associated with secure VMs
KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
...
Linus Torvalds [Wed, 12 Aug 2020 19:19:49 +0000 (12:19 -0700)]
Merge tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull more clk updates from Stephen Boyd:
"Here's some more updates that missed the last pull request because I
happened to tag the tree at an earlier point in the history of
clk-next. I must have fat fingered it and checked out an older version
of clk-next on this second computer I'm using.
This time it actually includes more code for Qualcomm SoCs, the AT91
major updates, and some Rockchip SoC clk driver updates as well. I've
corrected this flow so this shouldn't happen again"
* tag 'clk-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux: (83 commits)
clk: bcm2835: Do not use prediv with bcm2711's PLLs
clk: drop unused function __clk_get_flags
clk: hsdk: Fix bad dependency on IOMEM
dt-bindings: clock: Fix YAML schemas for LPASS clocks on SC7180
clk: mmp: avoid missing prototype warning
clk: sparx5: Add Sparx5 SoC DPLL clock driver
dt-bindings: clock: sparx5: Add bindings include file
clk: qoriq: add LS1021A core pll mux options
clk: clk-atlas6: fix return value check in atlas6_clk_init()
clk: tegra: pll: Improve PLLM enable-state detection
clk: X1000: Add support for calculat REFCLK of USB PHY.
clk: JZ4780: Reformat the code to align it.
clk: JZ4780: Add functions for enable and disable USB PHY.
clk: Ingenic: Add RTC related clocks for Ingenic SoCs.
dt-bindings: clock: Add tabs to align code.
dt-bindings: clock: Add RTC related clocks for Ingenic SoCs.
clk: davinci: Use fallthrough pseudo-keyword
clk: imx: Use fallthrough pseudo-keyword
clk: qcom: gcc-sdm660: Fix up gcc_mss_mnoc_bimc_axi_clk
clk: qcom: gcc-sdm660: Add missing modem reset
...
Linus Torvalds [Wed, 12 Aug 2020 19:13:44 +0000 (12:13 -0700)]
Merge tag 'linux-watchdog-5.9-rc1' of git://www.linux-watchdog.org/linux-watchdog
Pull watchdog updates from Wim Van Sebroeck:
- f71808e_wdt imporvements
- dw_wdt improvements
- mlx-wdt: support new watchdog type with longer timeout period
- fallthrough pseudo-keyword replacements
- overall small fixes and improvements
* tag 'linux-watchdog-5.9-rc1' of git://www.linux-watchdog.org/linux-watchdog: (35 commits)
watchdog: rti-wdt: balance pm runtime enable calls
watchdog: rti-wdt: attach to running watchdog during probe
watchdog: add support for adjusting last known HW keepalive time
watchdog: use __watchdog_ping in startup
watchdog: softdog: Add options 'soft_reboot_cmd' and 'soft_active_on_boot'
watchdog: pcwd_usb: remove needless check before usb_free_coherent()
watchdog: Replace HTTP links with HTTPS ones
dt-bindings: watchdog: renesas,wdt: Document r8a774e1 support
watchdog: initialize device before misc_register
watchdog: booke_wdt: Add common nowayout parameter driver
watchdog: scx200_wdt: Use fallthrough pseudo-keyword
watchdog: Use fallthrough pseudo-keyword
watchdog: f71808e_wdt: do stricter parameter validation
watchdog: f71808e_wdt: clear watchdog timeout occurred flag
watchdog: f71808e_wdt: remove use of wrong watchdog_info option
watchdog: f71808e_wdt: indicate WDIOF_CARDRESET support in watchdog_info.options
docs: watchdog: codify ident.options as superset of possible status flags
dt-bindings: watchdog: Add compatible for QCS404, SC7180, SDM845, SM8150
dt-bindings: watchdog: Convert QCOM watchdog timer bindings to YAML
watchdog: dw_wdt: Add DebugFS files
...
Linus Torvalds [Wed, 12 Aug 2020 18:53:01 +0000 (11:53 -0700)]
Merge tag 'drm-next-2020-08-12' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"This has a few vmwgfx regression fixes we hit from the merge window
(one in TTM), it also has a bunch of amdgpu fixes along with a
scattering everywhere else.
core:
- Fix drm_dp_mst_port refcount leaks in drm_dp_mst_allocate_vcpi
- Remove null check for kfree in drm_dev_release.
- Fix DRM_FORMAT_MOD_AMLOGIC_FBC definition.
- re-added docs for drm_gem_flink_ioctl()
- add orientation quirk for ASUS T103HAF
fbcon:
- Fix a fbcon OOB read in fbdev, found by syzbot.
vga:
- Mark vga_tryget static as it's not used elsewhere.
amdgpu:
- Re-add spelling typo fix
- Sienna Cichlid fixes
- Navy Flounder fixes
- DC fixes
- SMU i2c fix
- Power fixes
vmwgfx:
- regression fixes for modesetting crashes
- misc fixes
xlnx:
- Small fixes to xlnx.
omap:
- Fix mode initialization in omap_connector_mode_valid().
- force runtime PM suspend on system suspend
tidss:
- fix modeset init for DPI panels"
* tag 'drm-next-2020-08-12' of git://anongit.freedesktop.org/drm/drm: (70 commits)
drm/ttm: revert "drm/ttm: make TT creation purely optional v3"
drm/vmwgfx: fix spelling mistake "Cant" -> "Can't"
drm/vmwgfx: fix spelling mistake "Cound" -> "Could"
drm/vmwgfx/ldu: Use drm_mode_config_reset
drm/vmwgfx/sou: Use drm_mode_config_reset
drm/vmwgfx/stdu: Use drm_mode_config_reset
drm/vmwgfx: Fix two list_for_each loop exit tests
drm/vmwgfx: Use correct vmw_legacy_display_unit pointer
drm/vmwgfx: Use struct_size() helper
drm/amdgpu: Fix bug where DPM is not enabled after hibernate and resume
drm/amd/powerplay: put VCN/JPEG into PG ungate state before dpm table setup(V3)
drm/amd/powerplay: update swSMU VCN/JPEG PG logics
drm/amdgpu: use mode1 reset by default for sienna_cichlid
drm/amdgpu/smu: rework i2c adpater registration
drm/amd/display: Display goes blank after inst
drm/amd/display: Change null plane state swizzle mode to 4kb_s
drm/amd/display: Use helper function to check for HDMI signal
drm/amd/display: AMD OUI (DPCD 0x00300) skipped on some sink
drm/amd/display: Fix logger context
drm/amd/display: populate new dml variable
...
Linus Torvalds [Wed, 12 Aug 2020 18:24:12 +0000 (11:24 -0700)]
Merge branch 'akpm' (patches from Andrew)
Merge more updates from Andrew Morton:
- most of the rest of MM (memcg, hugetlb, vmscan, proc, compaction,
mempolicy, oom-kill, hugetlbfs, migration, thp, cma, util,
memory-hotplug, cleanups, uaccess, migration, gup, pagemap),
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (164 commits)
mm/gup: remove task_struct pointer for all gup code
mm: clean up the last pieces of page fault accountings
mm/xtensa: use general page fault accounting
mm/x86: use general page fault accounting
mm/sparc64: use general page fault accounting
mm/sparc32: use general page fault accounting
mm/sh: use general page fault accounting
mm/s390: use general page fault accounting
mm/riscv: use general page fault accounting
mm/powerpc: use general page fault accounting
mm/parisc: use general page fault accounting
mm/openrisc: use general page fault accounting
mm/nios2: use general page fault accounting
mm/nds32: use general page fault accounting
mm/mips: use general page fault accounting
mm/microblaze: use general page fault accounting
mm/m68k: use general page fault accounting
mm/ia64: use general page fault accounting
mm/hexagon: use general page fault accounting
mm/csky: use general page fault accounting
...