Linus Torvalds [Sun, 13 Dec 2020 19:31:19 +0000 (11:31 -0800)]
Merge tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Thomas Gleixner:
"A set of x86 and membarrier fixes:
- Correct a few problems in the x86 and the generic membarrier
implementation. Small corrections for assumptions about visibility
which have turned out not to be true.
- Make the PAT bits for memory encryption correct vs 4K and 2M/1G
page table entries as they are at a different location.
- Fix a concurrency issue in the the local bandwidth readout of
resource control leading to incorrect values
- Fix the ordering of allocating a vector for an interrupt. The order
missed to respect the provided cpumask when the first attempt of
allocating node local in the mask fails. It then tries the node
instead of trying the full provided mask first. This leads to
erroneous error messages and breaking the (user) supplied affinity
request. Reorder it.
- Make the INT3 padding detection in optprobe work correctly"
* tag 'x86-urgent-2020-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/kprobes: Fix optprobe to detect INT3 padding correctly
x86/apic/vector: Fix ordering in vector assignment
x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
membarrier: Execute SYNC_CORE on the calling thread
membarrier: Explicitly sync remote cores when SYNC_CORE is requested
membarrier: Add an actual barrier before rseq_preempt()
x86/membarrier: Get rid of a dubious optimization
Linus Torvalds [Sun, 13 Dec 2020 18:36:23 +0000 (10:36 -0800)]
Merge tag 'block-5.10-2020-12-12' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
"This should be it for 5.10.
Mike and Song looked into the warning case, and thankfully it appears
the fix was pretty trivial - we can just change the md device chunk
type to unsigned int to get rid of it. They cannot currently be < 0,
and nobody is checking for that either.
We're reverting the discard changes as the corruption reports came in
very late, and there's just no time to attempt to deal with it at this
point. Reverting the changes in question is the right call for 5.10"
* tag 'block-5.10-2020-12-12' of git://git.kernel.dk/linux-block:
md: change mddev 'chunk_sectors' from int to unsigned
Revert "md: add md_submit_discard_bio() for submitting discard bio"
Revert "md/raid10: extend r10bio devs to raid disks"
Revert "md/raid10: pull codes that wait for blocked dev into one function"
Revert "md/raid10: improve raid10 discard request"
Revert "md/raid10: improve discard request for far layout"
Revert "dm raid: remove unnecessary discard limits for raid10"
Linus Torvalds [Sat, 12 Dec 2020 20:57:12 +0000 (12:57 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Five small fixes. Four in drivers:
- hisi_sas: fix internal queue timeout
- be2iscsi: revert a prior fix causing problems
- bnx2i: add missing dependency
- storvsc: late arriving revert of a problem fix
and one in the core.
The core one is a minor change to stop paying attention to the busy
count when returning out of resources because there's a race window
where the queue might not restart due to missing returning I/O"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
Revert "scsi: storvsc: Validate length of incoming packet in storvsc_on_channel_callback()"
scsi: hisi_sas: Select a suitable queue for internal I/Os
scsi: core: Fix race between handling STS_RESOURCE and completion
scsi: be2iscsi: Revert "Fix a theoretical leak in beiscsi_create_eqs()"
scsi: bnx2i: Requires MMU
Linus Torvalds [Sat, 12 Dec 2020 18:08:16 +0000 (10:08 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull kvm fixes from Paolo Bonzini:
"Bugfixes for ARM, x86 and tools"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
tools/kvm_stat: Exempt time-based counters
KVM: mmu: Fix SPTE encoding of MMIO generation upper half
kvm: x86/mmu: Use cpuid to determine max gfn
kvm: svm: de-allocate svm_cpu_data for all cpus in svm_cpu_uninit()
selftests: kvm/set_memory_region_test: Fix race in move region test
KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()
KVM: arm64: Fix handling of merging tables into a block entry
KVM: arm64: Fix memory leak on stage2 update of a valid PTE
Linus Torvalds [Sat, 12 Dec 2020 18:02:03 +0000 (10:02 -0800)]
Merge tag 'for-linus-5.10c-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"A short series fixing a regression introduced in 5.9 for running as
Xen dom0 on a system with NVMe backed storage"
* tag 'for-linus-5.10c-rc8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen: don't use page->lru for ZONE_DEVICE memory
xen: add helpers for caching grant mapping pages
Linus Torvalds [Sat, 12 Dec 2020 17:50:26 +0000 (09:50 -0800)]
Merge tag 'riscv-for-linus-5.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fix from Palmer Dabbelt:
"Just one fix. It's nothing critical, just a randconfig that wasn't
building. That said, it does seem pretty safe and is technically a
regression so I'm sending it along for 5.10:
- define get_cycles64() all the time, as it's used by most
configurations"
* tag 'riscv-for-linus-5.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
RISC-V: Define get_cycles64() regardless of M-mode
Linus Torvalds [Sat, 12 Dec 2020 17:41:33 +0000 (09:41 -0800)]
Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input
Pull input fixes from Dmitry Torokhov:
- a fix for cm109 stomping on its own control URB if it tries to toggle
buzzer immediately after userspace opens input device (found by
syzcaller)
- another fix for Raydium touchscreens that do not like splitting
command transfers
- quirks for i8042, soc_button_array, and goodix drivers to make them
work better with certain hardware.
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
Input: goodix - add upside-down quirk for Teclast X98 Pro tablet
Input: cm109 - do not stomp on control URB
Input: i8042 - add Acer laptops to the i8042 reset list
Input: cros_ec_keyb - send 'scancodes' in addition to key events
Input: soc_button_array - add Lenovo Yoga Tablet2 1051L to the dmi_use_low_level_irq list
Input: raydium_ts_i2c - do not split tx transactions
Mike Snitzer [Sat, 12 Dec 2020 16:55:37 +0000 (11:55 -0500)]
md: change mddev 'chunk_sectors' from int to unsigned
Commit e2782f560c29 ("Revert "dm raid: remove unnecessary discard
limits for raid10"") exposed compiler warnings introduced by commit e0910c8e4f87 ("dm raid: fix discard limits for raid1 and raid10"):
In file included from ./include/linux/kernel.h:14,
from ./include/asm-generic/bug.h:20,
from ./arch/x86/include/asm/bug.h:93,
from ./include/linux/bug.h:5,
from ./include/linux/mmdebug.h:5,
from ./include/linux/gfp.h:5,
from ./include/linux/slab.h:15,
from drivers/md/dm-raid.c:8:
drivers/md/dm-raid.c: In function ‘raid_io_hints’:
./include/linux/minmax.h:18:28: warning: comparison of distinct pointer types lacks a cast
(!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
^~
./include/linux/minmax.h:32:4: note: in expansion of macro ‘__typecheck’
(__typecheck(x, y) && __no_side_effects(x, y))
^~~~~~~~~~~
./include/linux/minmax.h:42:24: note: in expansion of macro ‘__safe_cmp’
__builtin_choose_expr(__safe_cmp(x, y), \
^~~~~~~~~~
./include/linux/minmax.h:51:19: note: in expansion of macro ‘__careful_cmp’
#define min(x, y) __careful_cmp(x, y, <)
^~~~~~~~~~~~~
./include/linux/minmax.h:84:39: note: in expansion of macro ‘min’
__x == 0 ? __y : ((__y == 0) ? __x : min(__x, __y)); })
^~~
drivers/md/dm-raid.c:3739:33: note: in expansion of macro ‘min_not_zero’
limits->max_discard_sectors = min_not_zero(rs->md.chunk_sectors,
^~~~~~~~~~~~
Fix this by changing the chunk_sectors member of 'struct mddev' from
int to 'unsigned int' to match the type used for the 'chunk_sectors'
member of 'struct queue_limits'. Various MD code still uses 'int' but
none of it appears to ever make use of signed int; and storing
positive signed int in unsigned is perfectly safe.
Reported-by: Song Liu <songliubraving@fb.com> Fixes: e2782f560c29 ("Revert "dm raid: remove unnecessary discard limits for raid10"") Fixes: e0910c8e4f87 ("dm raid: fix discard limits for raid1 and raid10") Cc: stable@vger,kernel.org # e0910c8e4f87 was marked for stable@ Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Song Liu <song@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Masami Hiramatsu [Fri, 11 Dec 2020 07:04:17 +0000 (16:04 +0900)]
x86/kprobes: Fix optprobe to detect INT3 padding correctly
Commit
7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes")
changed the padding bytes between functions from NOP to INT3. However,
when optprobe decodes a target function it finds INT3 and gives up the
jump optimization.
Instead of giving up any INT3 detection, check whether the rest of the
bytes to the end of the function are INT3. If all of them are INT3,
those come from the linker. In that case, continue the optprobe jump
optimization.
[ bp: Massage commit message. ]
Fixes: 7705dc855797 ("x86/vmlinux: Use INT3 instead of NOP for linker fill bytes") Reported-by: Adam Zabrocki <pi3@pi3.com.pl> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Reviewed-by: Kees Cook <keescook@chromium.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/160767025681.3880685.16021570341428835411.stgit@devnote2
Stefan Raspl [Tue, 8 Dec 2020 21:08:29 +0000 (22:08 +0100)]
tools/kvm_stat: Exempt time-based counters
The new counters halt_poll_success_ns and halt_poll_fail_ns do not count
events. Instead they provide a time, and mess up our statistics. Therefore,
we should exclude them.
Removal is currently implemented with an exempt list. If more counters like
these appear, we can think about a more general rule like excluding all
fields name "*_ns", in case that's a standing convention.
Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Tested-and-reported-by: Christian Borntraeger <borntraeger@de.ibm.com>
Message-Id: <20201208210829.101324-1-raspl@linux.ibm.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
KVM: mmu: Fix SPTE encoding of MMIO generation upper half
Commit cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
cleaned up the computation of MMIO generation SPTE masks, however it
introduced a bug how the upper part was encoded:
SPTE bits 52-61 were supposed to contain bits 10-19 of the current
generation number, however a missing shift encoded bits 1-10 there instead
(mostly duplicating the lower part of the encoded generation number that
then consisted of bits 1-9).
In the meantime, the upper part was shrunk by one bit and moved by
subsequent commits to become an upper half of the encoded generation number
(bits 9-17 of bits 0-17 encoded in a SPTE).
In addition to the above, commit 56871d444bc4 ("KVM: x86: fix overlap between SPTE_MMIO_MASK and generation")
has changed the SPTE bit range assigned to encode the generation number and
the total number of bits encoded but did not update them in the comment
attached to their defines, nor in the KVM MMU doc.
Let's do it here, too, since it is too trivial thing to warrant a separate
commit.
Fixes: cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling") Signed-off-by: Maciej S. Szmigiero <maciej.szmigiero@oracle.com>
Message-Id: <156700708db2a5296c5ed7a8b9ac71f1e9765c85.1607129096.git.maciej.szmigiero@oracle.com> Cc: stable@vger.kernel.org
[Reorganize macros so that everything is computed from the bit ranges. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Linus Torvalds [Fri, 11 Dec 2020 22:29:46 +0000 (14:29 -0800)]
Merge tag 'mtd/fixes-for-5.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd fixes from Miquel Raynal:
"Second series of fixes for raw NAND drivers initiated because of a
rework of the ECC engine subsystem.
The location of the DT parsing logic got moved, breaking several
drivers which in fact were not doing the ECC engine initialization at
the right place.
These drivers have been fixed by enforcing a particular ECC engine
type and algorithm, software Hamming, while the algorithm may be
overwritten by a DT property. This merge request fixes this in the
xway, socrates, plat_nand, pasemi, orion, mpc5121, gpio, au1550 and
ams-delta controller drivers"
* tag 'mtd/fixes-for-5.10-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: rawnand: xway: Do not force a particular software ECC engine
mtd: rawnand: socrates: Do not force a particular software ECC engine
mtd: rawnand: plat_nand: Do not force a particular software ECC engine
mtd: rawnand: pasemi: Do not force a particular software ECC engine
mtd: rawnand: orion: Do not force a particular software ECC engine
mtd: rawnand: mpc5121: Do not force a particular software ECC engine
mtd: rawnand: gpio: Do not force a particular software ECC engine
mtd: rawnand: au1550: Do not force a particular software ECC engine
mtd: rawnand: ams-delta: Do not force a particular software ECC engine
Linus Torvalds [Fri, 11 Dec 2020 22:26:17 +0000 (14:26 -0800)]
Merge tag 'mmc-v5.10-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"A couple of MMC fixes:
MMC core:
- Fixup condition for CMD13 polling for RPMB requests
MMC host:
- mtk-sd: Fix system suspend/resume support for CQHCI
- mtd-sd: Extend SDIO IRQ fix to more variants
- sdhci-of-arasan: Fix clock registration error for Keem Bay SOC
- tmio: Bring HW to a sane state after a power off"
* tag 'mmc-v5.10-rc4-2' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: mediatek: mark PM functions as __maybe_unused
mmc: block: Fixup condition for CMD13 polling for RPMB requests
mmc: tmio: improve bringing HW to a sane state with MMC_POWER_OFF
mmc: sdhci-of-arasan: Fix clock registration error for Keem Bay SOC
mmc: mediatek: Extend recheck_sdio_irq fix to more variants
mmc: mediatek: Fix system suspend/resume support for CQHCI
Andrii Nakryiko [Fri, 11 Dec 2020 21:36:25 +0000 (22:36 +0100)]
bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers
Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.
Linus Torvalds [Fri, 11 Dec 2020 22:10:51 +0000 (14:10 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"8 patches.
Subsystems affected by this patch series: proc, selftests, kbuild, and
mm (pagecache, kasan, hugetlb)"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
mm/hugetlb: clear compound_nr before freeing gigantic pages
kasan: fix object remaining in offline per-cpu quarantine
elfcore: fix building with clang
initramfs: fix clang build failure
kbuild: avoid static_assert for genksyms
selftest/fpu: avoid clang warning
proc: use untagged_addr() for pagemap_read addresses
revert "mm/filemap: add static for function __add_to_page_cache_locked"
Gerald Schaefer [Fri, 11 Dec 2020 21:36:53 +0000 (13:36 -0800)]
mm/hugetlb: clear compound_nr before freeing gigantic pages
Commit 1378a5ee451a ("mm: store compound_nr as well as compound_order")
added compound_nr counter to first tail struct page, overlaying with
page->mapping. The overlay itself is fine, but while freeing gigantic
hugepages via free_contig_range(), a "bad page" check will trigger for
non-NULL page->mapping on the first tail page:
BUG: Bad page state in process bash pfn:380001
page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
aops:0x0
flags: 0x3ffff00000000000()
raw: 3ffff00000000000000000000000010000000000000001220000000100000000
raw: 00000000000000000000000000000000ffffffff000000000000000000000000
page dumped because: non-NULL mapping
Modules linked in:
CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
Hardware name: IBM 3906 M03 703 (LPAR)
Call Trace:
show_stack+0x6e/0xe8
dump_stack+0x90/0xc8
bad_page+0xd6/0x130
free_pcppages_bulk+0x26a/0x800
free_unref_page+0x6e/0x90
free_contig_range+0x94/0xe8
update_and_free_page+0x1c4/0x2c8
free_pool_huge_page+0x11e/0x138
set_max_huge_pages+0x228/0x300
nr_hugepages_store_common+0xb8/0x130
kernfs_fop_write+0xd2/0x218
vfs_write+0xb0/0x2b8
ksys_write+0xac/0xe0
system_call+0xe6/0x288
Disabling lock debugging due to kernel taint
This is because only the compound_order is cleared in
destroy_compound_gigantic_page(), and compound_nr is set to
1U << order == 1 for order 0 in set_compound_order(page, 0).
Fix this by explicitly clearing compound_nr for first tail page after
calling set_compound_order(page, 0).
Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.com Fixes: 1378a5ee451a ("mm: store compound_nr as well as compound_order") Signed-off-by: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: <stable@vger.kernel.org> [5.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kuan-Ying Lee [Fri, 11 Dec 2020 21:36:49 +0000 (13:36 -0800)]
kasan: fix object remaining in offline per-cpu quarantine
We hit this issue in our internal test. When enabling generic kasan, a
kfree()'d object is put into per-cpu quarantine first. If the cpu goes
offline, object still remains in the per-cpu quarantine. If we call
kmem_cache_destroy() now, slub will report "Objects remaining" error.
=============================================================================
BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown()
-----------------------------------------------------------------------------
Register a cpu hotplug function to remove all objects in the offline
per-cpu quarantine when cpu is going offline. Set a per-cpu variable to
indicate this cpu is offline.
[qiang.zhang@windriver.com: fix slab double free when cpu-hotplug] Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com Signed-off-by: Kuan-Ying Lee <Kuan-Ying.Lee@mediatek.com> Signed-off-by: Zqiang <qiang.zhang@windriver.com> Suggested-by: Dmitry Vyukov <dvyukov@google.com> Reported-by: Guangye Yang <guangye.yang@mediatek.com> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Matthias Brugger <matthias.bgg@gmail.com> Cc: Nicholas Tang <nicholas.tang@mediatek.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: Qian Cai <qcai@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Arnd Bergmann [Fri, 11 Dec 2020 21:36:46 +0000 (13:36 -0800)]
elfcore: fix building with clang
kernel/elfcore.c only contains weak symbols, which triggers a bug with
clang in combination with recordmcount:
Cannot find symbol for section 2: .text.
kernel/elfcore.o: failed
Move the empty stubs into linux/elfcore.h as inline functions. As only
two architectures use these, just use the architecture specific Kconfig
symbols to key off the declaration.
Arnd Bergmann [Fri, 11 Dec 2020 21:36:42 +0000 (13:36 -0800)]
initramfs: fix clang build failure
There is only one function in init/initramfs.c that is in the .text
section, and it is marked __weak. When building with clang-12 and the
integrated assembler, this leads to a bug with recordmcount:
./scripts/recordmcount "init/initramfs.o"
Cannot find symbol for section 2: .text.
init/initramfs.o: failed
I'm not quite sure what exactly goes wrong, but I notice that this
function is only ever called from an __init function, and normally
inlined. Marking it __init as well is clearly correct and it leads to
recordmcount no longer complaining.
Arnd Bergmann [Fri, 11 Dec 2020 21:36:38 +0000 (13:36 -0800)]
kbuild: avoid static_assert for genksyms
genksyms does not know or care about the _Static_assert() built-in, and
sometimes falls back to ignoring the later symbols, which causes
undefined behavior such as
WARNING: modpost: EXPORT symbol "ethtool_set_ethtool_phy_ops" [vmlinux] version generation failed, symbol will not be versioned.
ld: net/ethtool/common.o: relocation R_AARCH64_ABS32 against `__crc_ethtool_set_ethtool_phy_ops' can not be used when making a shared object
net/ethtool/common.o:(_ftrace_annotated_branch+0x0): dangerous relocation: unsupported relocation
Redefine static_assert for genksyms to avoid that.
Link: https://lkml.kernel.org/r/20201203230955.1482058-1-arnd@kernel.org Signed-off-by: Arnd Bergmann <arnd@arndb.de> Suggested-by: Ard Biesheuvel <ardb@kernel.org> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Michal Marek <michal.lkml@markovi.net> Cc: Kees Cook <keescook@chromium.org> Cc: Rikard Falkeborn <rikard.falkeborn@gmail.com> Cc: Marco Elver <elver@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Miles Chen [Fri, 11 Dec 2020 21:36:31 +0000 (13:36 -0800)]
proc: use untagged_addr() for pagemap_read addresses
When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag.
To fix it, we should untag the userspace pointers in pagemap_read().
I tested with 5.10-rc4 and the issue remains.
Explanation from Catalin in [1]:
"Arguably, that's a user-space bug since tagged file offsets were never
supported. In this case it's not even a tag at bit 56 as per the arm64
tagged address ABI but rather down to bit 47. You could say that the
problem is caused by the C library (malloc()) or whoever created the
tagged vaddr and passed it to this function. It's not a kernel
regression as we've never supported it.
Now, pagemap is a special case where the offset is usually not
generated as a classic file offset but rather derived by shifting a
user virtual address. I guess we can make a concession for pagemap
(only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"
My test code is based on [2]:
A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8
userspace program:
uint64 OsLayer::VirtualToPhysical(void *vaddr) {
uint64 frame, paddr, pfnmask, pagemask;
int pagesize = sysconf(_SC_PAGESIZE);
off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
int fd = open(kPagemapPath, O_RDONLY);
...
if (lseek64(fd, off, SEEK_SET) != off || read(fd, &frame, 8) != 8) {
int err = errno;
string errtxt = ErrorString(err);
if (fd >= 0)
close(fd);
return 0;
}
...
}
/* watch out for wraparound */
// svpfn == 0xb400007662f54
// (mm->task_size >> PAGE) == 0x8000000
if (svpfn > mm->task_size >> PAGE_SHIFT) // the condition is true because of the tag 0xb4
start_vaddr = end_vaddr;
ret = 0;
while (count && (start_vaddr < end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
int len;
unsigned long end;
...
}
...
}
Andrew Morton [Fri, 11 Dec 2020 21:36:27 +0000 (13:36 -0800)]
revert "mm/filemap: add static for function __add_to_page_cache_locked"
Revert commit 3351b16af494 ("mm/filemap: add static for function
__add_to_page_cache_locked") due to incompatibility with
ALLOW_ERROR_INJECTION which result in build errors.
Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com Tested-by: Justin Forbes <jmforbes@linuxtx.org> Tested-by: Greg Thelen <gthelen@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Cc: Michal Kubecek <mkubecek@suse.cz> Cc: Alex Shi <alex.shi@linux.alibaba.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Tony Luck <tony.luck@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dmitry Torokhov [Thu, 10 Dec 2020 04:13:24 +0000 (20:13 -0800)]
Input: cm109 - do not stomp on control URB
We need to make sure we are not stomping on the control URB that was
issued when opening the device when attempting to toggle buzzer.
To do that we need to mark it as pending in cm109_open().
Miquel Raynal [Thu, 3 Dec 2020 19:03:40 +0000 (20:03 +0100)]
mtd: rawnand: xway: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Miquel Raynal [Thu, 3 Dec 2020 19:03:39 +0000 (20:03 +0100)]
mtd: rawnand: socrates: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Miquel Raynal [Thu, 3 Dec 2020 19:03:38 +0000 (20:03 +0100)]
mtd: rawnand: plat_nand: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Miquel Raynal [Thu, 3 Dec 2020 19:03:37 +0000 (20:03 +0100)]
mtd: rawnand: pasemi: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Miquel Raynal [Thu, 3 Dec 2020 19:03:36 +0000 (20:03 +0100)]
mtd: rawnand: orion: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Reported-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Fixes: 553508cec2e8 ("mtd: rawnand: orion: Move the ECC initialization to ->attach_chip()") Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com> Tested-by: Chris Packham <chris.packham@alliedtelesis.co.nz> Link: https://lore.kernel.org/linux-mtd/20201203190340.15522-6-miquel.raynal@bootlin.com
Miquel Raynal [Thu, 3 Dec 2020 19:03:35 +0000 (20:03 +0100)]
mtd: rawnand: mpc5121: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Miquel Raynal [Thu, 3 Dec 2020 19:03:34 +0000 (20:03 +0100)]
mtd: rawnand: gpio: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Miquel Raynal [Thu, 3 Dec 2020 19:03:33 +0000 (20:03 +0100)]
mtd: rawnand: au1550: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Miquel Raynal [Thu, 3 Dec 2020 19:03:32 +0000 (20:03 +0100)]
mtd: rawnand: ams-delta: Do not force a particular software ECC engine
Originally, commit d7157ff49a5b ("mtd: rawnand: Use the ECC framework
user input parsing bits") kind of broke the logic around the
initialization of several ECC engines.
Unfortunately, the fix (which indeed moved the ECC initialization to
the right place) did not take into account the fact that a different
ECC algorithm could have been used thanks to a DT property,
considering the "Hamming" algorithm entry a configuration while it was
only a default.
Add the necessary logic to be sure Hamming keeps being only a default.
Linus Torvalds [Fri, 11 Dec 2020 18:25:04 +0000 (10:25 -0800)]
Merge tag 'pinctrl-v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
Pull pin control fixes from Linus Walleij:
"Here is a late set of pin control fixes for v5.10, most concern some
minor and major issues found in the Intel drivers. Some are so hairy
that I have no idea what is going on there, but luckily the maintainer
knows what's up.
We also have an interesting fix for AMD, which makes AMD-based laptops
more stable IIUC.
Summary:
- Fix up some SPI group and a register offset on Intel Jasperlake
- Set default bias on Intel Merrifield
- Preserve debouncing on Intel Baytrail
- Stop .set_type() irqchip callback in the AMD driver from fiddling
with the debounce filter
- Fix access to GPIO banks that are pass-thru on the Aspeed
- Fix a fix for the Intel pin control driver to disable Rx/Tx when
requesting a UART line as GPIO"
* tag 'pinctrl-v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
pinctrl: intel: Actually disable Tx and Rx buffers on GPIO request
pinctrl: aspeed: Fix GPIO requests on pass-through banks
pinctrl: amd: remove debounce filter setting in IRQ type setting
pinctrl: baytrail: Avoid clearing debounce value when turning it off
pinctrl: merrifield: Set default bias in case no particular value given
pinctrl: jasperlake: Fix HOSTSW_OWN offset
pinctrl: jasperlake: Unhide SPI group of pins
Linus Torvalds [Fri, 11 Dec 2020 18:22:17 +0000 (10:22 -0800)]
Merge tag 'v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio
Pull GPIO fixes from Linus Walleij:
"These are hopefully the last GPIO fixes for this cycle.
All are driver fixes except a small resource leak for pin ranges in
the gpiolib. Two are PM related, which is nice because when developers
start to find PM bugs it is usually because they have smoked out the
bugs of more severe nature.
Summary:
- Fix runtime PM balancing on the errorpath of the Arizona driver
- Fix a suspend NULL pointer reference in the dwapb driver
- Balance free:ing in gpiochip_generic_free()
- Fix runtime PM balancing on the errorpath of the zynq driver
- Fix irqdomain use-after-free in the mvebu driver
- Break an eternal loop in the spreadtrum EIC driver"
* tag 'v5.10-3' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-gpio:
gpio: eic-sprd: break loop when getting NULL device resource
gpio: mvebu: fix potential user-after-free on probe
gpio: zynq: fix reference leak in zynq_gpio functions
gpiolib: Don't free if pin ranges are not defined
gpio: dwapb: fix NULL pointer dereference at dwapb_gpio_suspend()
gpio: arizona: disable pm_runtime in case of failure
Linus Torvalds [Fri, 11 Dec 2020 18:10:13 +0000 (10:10 -0800)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"Two small clk driver build fixes
- Remove __packed from a Renesas struct to improve portability
- Fix a linking problem with i.MX when config options don't agree"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: renesas: r9a06g032: Drop __packed for portability
clk: imx: scu: fix MXC_CLK_SCU module build break
Dexuan reported a regression where StorVSC fails to probe a device (and
where, consequently, the VM may fail to boot). The root-cause analysis led
to a long-standing race condition that is exposed by the validation /commit
in question. Let's put the new validation aside until a proper solution
for that race condition is in place.
Link: https://lore.kernel.org/r/20201211131404.21359-1-parri.andrea@gmail.com Fixes: 3b8c72d076c4 ("scsi: storvsc: Validate length of incoming packet in storvsc_on_channel_callback()") Cc: Dexuan Cui <decui@microsoft.com> Cc: "James E.J. Bottomley" <jejb@linux.ibm.com> Cc: "Martin K. Petersen" <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org Signed-off-by: Andrea Parri (Microsoft) <parri.andrea@gmail.com> Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com>
Linus Torvalds [Fri, 11 Dec 2020 01:52:13 +0000 (17:52 -0800)]
Merge tag 'drm-fixes-2020-12-11' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"Last week of fixes, just amdgpu and i915 collections. We had a i915
regression reported by HJ Lu reported this morning, and this contains
a fix for that he has tested.
There are a fair few other fixes, but they are spread across the two
drivers, and all fairly self contained.
amdgpu:
- Fan fix for CI asics
- Fix a warning in possible_crtcs
- Build fix for when debugfs is disabled
- Display overflow fix
- Display watermark fixes for Renoir
- SDMA 5.2 fix
- Stolen vga memory regression fix
- Power profile fixes
- Fix a regression from removal of GEM and PRIME callbacks
* tag 'drm-fixes-2020-12-11' of git://anongit.freedesktop.org/drm/drm:
drm/i915/display: Go softly softly on initial modeset failure
drm/amd/pm: typo fix (CUSTOM -> COMPUTE)
drm/amdgpu: Initialise drm_gem_object_funcs for imported BOs
drm/amdgpu: fix size calculation with stolen vga memory
drm/amd/pm: update smu10.h WORKLOAD_PPLIB setting for raven
drm/amdkfd: Fix leak in dmabuf import
drm/amdgpu: fix sdma instance fw version and feature version init
drm/amd/display: Add wm table for Renoir
drm/amd/display: Prevent bandwidth overflow
drm/amdgpu: fix debugfs creation/removal, again
drm/amdgpu/disply: set num_crtc earlier
drm/amdgpu/powerplay: parse fan table for CI asics
drm/i915/gt: Declare gen9 has 64 mocs entries!
drm/i915/display/dp: Compute the correct slice count for VDSC on DP
drm/i915: fix size_t greater or equal to zero comparison
drm/i915/gt: Cancel the preemption timeout on responding to it
drm/i915/gt: Ignore repeated attempts to suspend request flow across reset
drm/i915/gem: Propagate error from cancelled submit due to context closure
drm/i915/gem: Check the correct variable in selftest
Palmer Dabbelt [Wed, 25 Nov 2020 19:57:03 +0000 (11:57 -0800)]
RISC-V: Define get_cycles64() regardless of M-mode
The timer driver uses get_cycles64() unconditionally to obtain the current
time. A recent refactoring lost the common definition for some configs, which
is now the only one we need.
Fixes: d5be89a8d118 ("RISC-V: Resurrect the MMIO timer implementation for M-mode systems") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
Linus Torvalds [Fri, 11 Dec 2020 00:36:30 +0000 (16:36 -0800)]
Merge tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fix from Michael Ellerman:
"One commit to implement copy_from_kernel_nofault_allowed(), otherwise
copy_from_kernel_nofault() can trigger warnings when accessing bad
addresses in some configurations.
Thanks to Christophe Leroy and Qian Cai"
* tag 'powerpc-5.10-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/mm: Fix KUAP warning by providing copy_from_kernel_nofault_allowed()
Chris Wilson [Thu, 10 Dec 2020 23:07:41 +0000 (23:07 +0000)]
drm/i915/display: Go softly softly on initial modeset failure
Reduce the module/device probe error into a mere debug to hide issues
where the initial modeset is failing (after lies told by hw probe) and
the system hangs with a livelock in cleaning up the failed commit.
Reported-by: H.J. Lu <hjl.tools@gmail.com>
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210619 Fixes: b3bf99daaee9 ("drm/i915/display: Defer initial modeset until after GGTT is initialised") Fixes: ccc9e67ab26f ("drm/i915/display: Defer initial modeset until after GGTT is initialised") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: "Ville Syrjälä" <ville.syrjala@linux.intel.com> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: H.J. Lu <hjl.tools@gmail.com> Cc: Dave Airlie <airlied@redhat.com> Reviewed-by: Rodrigo Vivi <rodrigo.vivi@intel.com> Signed-off-by: Dave Airlie <airlied@redhat.com> Link: https://patchwork.freedesktop.org/patch/msgid/20201210230741.17140-1-chris@chris-wilson.co.uk
Dave Airlie [Thu, 10 Dec 2020 23:47:38 +0000 (09:47 +1000)]
Merge tag 'drm-intel-fixes-2020-12-09' of git://anongit.freedesktop.org/drm/drm-intel into drm-fixes
Fixes for VDSC/DP, selftests, shmem_utils, preemption, submission, and gt reset:
- Check the correct variable in selftest (Dan)
- Propagate error from canceled submit due to context closure (Chris)
- Ignore repeated attempts to suspend request flow across reset (Chris)
- Cancel the preemption timeout on responding to it (Chris)
- Fix unsigned compared against 0 (Colin)
- Compute the correct slice count for VDSC on DP (Manasi)
- Declar gen9 has 64 mocs entries (Chris)
Dave Airlie [Thu, 10 Dec 2020 23:42:14 +0000 (09:42 +1000)]
Merge tag 'amd-drm-fixes-5.10-2020-12-09' of git://people.freedesktop.org/~agd5f/linux into drm-fixes
amd-drm-fixes-5.10-2020-12-09:
amdgpu:
- Fan fix for CI asics
- Fix a warning in possible_crtcs
- Build fix for when debugfs is disabled
- Display overflow fix
- Display watermark fixes for Renoir
- SDMA 5.2 fix
- Stolen vga memory regression fix
- Power profile fixes
- Fix a regression from removal of GEM and PRIME callbacks
Linus Torvalds [Thu, 10 Dec 2020 23:36:09 +0000 (15:36 -0800)]
Merge tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs
Pull NFS client fixes from Anna Schumaker:
"Here are a handful more bugfixes for 5.10.
Unfortunately, we found some problems with the new READ_PLUS operation
that aren't easy to fix. We've decided to disable this codepath
through a Kconfig option for now, but a series of patches going into
5.11 will clean up the code and fix the issues at the same time. This
seemed like the best way to go about it.
Summary:
- Fix array overflow when flexfiles mirroring is enabled
- Fix rpcrdma_inline_fixup() crash with new LISTXATTRS
- Fix 5 second delay when doing inter-server copy
- Disable READ_PLUS by default"
* tag 'nfs-for-5.10-3' of git://git.linux-nfs.org/projects/anna/linux-nfs:
NFS: Disable READ_PLUS by default
NFSv4.2: Fix 5 seconds delay when doing inter server copy
NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
pNFS/flexfiles: Fix array overflow when flexfiles mirroring is enabled
2) Fix memory leak in xfrm_user_policy(). Fix from Yu Kuai.
3) Fix polling in xsk sockets by using sk_poll_wait() instead of
datagram_poll() which keys off of sk_wmem_alloc and such which xsk
sockets do not update. From Xuan Zhuo.
4) Missing init of rekey_data in cfgh80211, from Sara Sharon.
5) Fix destroy of timer before init, from Davide Caratti.
6) Missing CRYPTO_CRC32 selects in ethernet driver Kconfigs, from Arnd
Bergmann.
7) Missing error return in rtm_to_fib_config() switch case, from Zhang
Changzhong.
8) Fix some src/dest address handling in vrf and add a testcase. From
Stephen Suryaputra.
9) Fix multicast handling in Seville switches driven by mscc-ocelot
driver. From Vladimir Oltean.
10) Fix proto value passed to skb delivery demux in udp, from Xin Long.
11) HW pkt counters not reported correctly in enetc driver, from Claudiu
Manoil.
12) Fix deadlock in bridge, from Joseph Huang.
13) Missing of_node_pur() in dpaa2 driver, fromn Christophe JAILLET.
14) Fix pid fetching in bpftool when there are a lot of results, from
Andrii Nakryiko.
15) Fix long timeouts in nft_dynset, from Pablo Neira Ayuso.
16) Various stymmac fixes, from Fugang Duan.
17) Fix null deref in tipc, from Cengiz Can.
18) When mss is biog, coose more resonable rcvq_space in tcp, fromn Eric
Dumazet.
19) Revert a geneve change that likely isnt necessary, from Jakub
Kicinski.
20) Avoid premature rx buffer reuse in various Intel driversm from Björn
Töpel.
21) retain EcT bits during TIS reflection in tcp, from Wei Wang.
22) Fix Tso deferral wrt. cwnd limiting in tcp, from Neal Cardwell.
23) MPLS_OPT_LSE_LABEL attribute is 342 ot 8 bits, from Guillaume Nault
24) Fix propagation of 32-bit signed bounds in bpf verifier and add test
cases, from Alexei Starovoitov.
* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (81 commits)
selftests: fix poll error in udpgro.sh
selftests/bpf: Fix "dubious pointer arithmetic" test
selftests/bpf: Fix array access with signed variable test
selftests/bpf: Add test for signed 32-bit bound check bug
bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
MAINTAINERS: Add entry for Marvell Prestera Ethernet Switch driver
net: sched: Fix dump of MPLS_OPT_LSE_LABEL attribute in cls_flower
net/mlx4_en: Handle TX error CQE
net/mlx4_en: Avoid scheduling restart task if it is already running
tcp: fix cwnd-limited bug for TSO deferral where we send nothing
net: flow_offload: Fix memory leak for indirect flow block
tcp: Retain ECT bits for tos reflection
ethtool: fix stack overflow in ethnl_parse_bitset()
e1000e: fix S0ix flow to allow S0i3.2 subset entry
ice: avoid premature Rx buffer reuse
ixgbe: avoid premature Rx buffer reuse
i40e: avoid premature Rx buffer reuse
igb: avoid transmit queue timeout in xdp path
igb: use xdp_do_flush
igb: skb add metasize for xdp
...
Thomas Gleixner [Thu, 10 Dec 2020 20:18:22 +0000 (21:18 +0100)]
x86/apic/vector: Fix ordering in vector assignment
Prarit reported that depending on the affinity setting the
' irq $N: Affinity broken due to vector space exhaustion.'
message is showing up in dmesg, but the vector space on the CPUs in the
affinity mask is definitely not exhausted.
Shung-Hsi provided traces and analysis which pinpoints the problem:
The ordering of trying to assign an interrupt vector in
assign_irq_vector_any_locked() is simply wrong if the interrupt data has a
valid node assigned. It does:
1) Try the intersection of affinity mask and node mask
2) Try the node mask
3) Try the full affinity mask
4) Try the full online mask
Obviously #2 and #3 are in the wrong order as the requested affinity
mask has to take precedence.
In the observed cases #1 failed because the affinity mask did not contain
CPUs from node 0. That made it allocate a vector from node 0, thereby
breaking affinity and emitting the misleading message.
Revert the order of #2 and #3 so the full affinity mask without the node
intersection is tried before actually affinity is broken.
If no node is assigned then only the full affinity mask and if that fails
the full online mask is tried.
Anna Schumaker [Thu, 3 Dec 2020 20:18:39 +0000 (15:18 -0500)]
NFS: Disable READ_PLUS by default
We've been seeing failures with xfstests generic/091 and generic/263
when using READ_PLUS. I've made some progress on these issues, and the
tests fail later on but still don't pass. Let's disable READ_PLUS by
default until we can work out what is going on.
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Dai Ngo [Tue, 24 Nov 2020 03:15:17 +0000 (22:15 -0500)]
NFSv4.2: Fix 5 seconds delay when doing inter server copy
Since commit b4868b44c5628 ("NFSv4: Wait for stateid updates after
CLOSE/OPEN_DOWNGRADE"), every inter server copy operation suffers 5
seconds delay regardless of the size of the copy. The delay is from
nfs_set_open_stateid_locked when the check by nfs_stateid_is_sequential
fails because the seqid in both nfs4_state and nfs4_stateid are 0.
Fix __nfs42_ssc_open to delay setting of NFS_OPEN_STATE in nfs4_state,
until after the call to update_open_stateid, to indicate this is the 1st
open. This fix is part of a 2 patches, the other patch is the fix in the
source server to return the stateid for COPY_NOTIFY request with seqid 1
instead of 0.
Fixes: ce0887ac96d3 ("NFSD add nfs4 inter ssc to nfsd4_copy") Signed-off-by: Dai Ngo <dai.ngo@oracle.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Chuck Lever [Wed, 25 Nov 2020 00:15:18 +0000 (19:15 -0500)]
NFS: Fix rpcrdma_inline_fixup() crash with new LISTXATTRS operation
By switching to an XFS-backed export, I am able to reproduce the
ibcomp worker crash on my client with xfstests generic/013.
For the failing LISTXATTRS operation, xdr_inline_pages() is called
with page_len=12 and buflen=128.
- When ->send_request() is called, rpcrdma_marshal_req() does not
set up a Reply chunk because buflen is smaller than the inline
threshold. Thus rpcrdma_convert_iovs() does not get invoked at
all and the transport's XDRBUF_SPARSE_PAGES logic is not invoked
on the receive buffer.
- During reply processing, rpcrdma_inline_fixup() tries to copy
received data into rq_rcv_buf->pages because page_len is positive.
But there are no receive pages because rpcrdma_marshal_req() never
allocated them.
The result is that the ibcomp worker faults and dies. Sometimes that
causes a visible crash, and sometimes it results in a transport hang
without other symptoms.
RPC/RDMA's XDRBUF_SPARSE_PAGES support is not entirely correct, and
should eventually be fixed or replaced. However, my preference is
that upper-layer operations should explicitly allocate their receive
buffers (using GFP_KERNEL) when possible, rather than relying on
XDRBUF_SPARSE_PAGES.
Reported-by: Olga kornievskaia <kolga@netapp.com> Suggested-by: Olga kornievskaia <kolga@netapp.com> Fixes: c10a75145feb ("NFSv4.2: add the extended attribute proc functions.") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Reviewed-by: Olga kornievskaia <kolga@netapp.com> Reviewed-by: Frank van der Linden <fllinden@amazon.com> Tested-by: Olga kornievskaia <kolga@netapp.com> Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Paolo Abeni [Wed, 9 Dec 2020 11:21:13 +0000 (12:21 +0100)]
selftests: fix poll error in udpgro.sh
The test program udpgso_bench_rx always invokes the poll()
syscall with a timeout of 10ms. If a larger timeout is specified
via the command line, udpgso_bench_rx is supposed to do multiple
poll() calls till the timeout is expired or an event is received.
Currently the poll() loop errors out after the first invocation with
no events, and may causes self-tests failure alike:
This change addresses the issue allowing the poll() loop to consume
all the configured timeout.
Fixes: ada641ff6ed3 ("selftests: fixes for UDP GRO") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
selftests/bpf: Fix "dubious pointer arithmetic" test
The verifier trace changed following a bugfix. After checking the 64-bit
sign, only the upper bit mask is known, not bit 31. Update the test
accordingly.
selftests/bpf: Add test for signed 32-bit bound check bug
After a 32-bit load followed by a branch, the verifier would reduce the
maximum bound of the register to 0x7fffffff, allowing a user to bypass
bound checks. Ensure such a program is rejected.
In the second test, the 64-bit compare should not sufficient to
determine whether the signed 32-bit lower bound is 0, so the verifier
should reject the second branch.
bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds.
The 64-bit signed bounds should not affect 32-bit signed bounds unless the
verifier knows that upper 32-bits are either all 1s or all 0s. For example the
register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
since smax_value could be larger than 32-bit subregister can hold.
The verifier refines the smax/s32_max return value from certain helpers in
do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
value is also bounded. When both smin and smax bounds fit into 32-bit
subregister the verifier can propagate those bounds.
Linus Torvalds [Thu, 10 Dec 2020 19:00:27 +0000 (11:00 -0800)]
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma
Pull rdma fixes from Jason Gunthorpe:
"Two user triggerable crashers and a some EFA related regressions:
- Syzkaller found a bug in CM
- Restore access to the GID table and fix modify_qp for EFA
- Crasher in qedr"
* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
RDMA/cm: Fix an attempt to use non-valid pointer when cleaning timewait
RDMA/core: Fix empty gid table for non IB/RoCE devices
RDMA/efa: Use the correct current and new states in modify QP
RDMA/qedr: iWARP invalid(zero) doorbell address fix
Linus Torvalds [Thu, 10 Dec 2020 18:48:49 +0000 (10:48 -0800)]
Merge tag 'media/v5.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media
Pull media fixes from Mauro Carvalho Chehab:
"A couple of fixes:
- videobuf2: fix a DMABUF bug, preventing it to properly handle cache
sync/flush
- vidtv: an usage after free and a few sparse/smatch warning fixes
- pulse8-cec: a duplicate free and a bug related to new firmware
usage
- mtk-cir: fix a regression on a clock setting"
* tag 'media/v5.10-4' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media:
media: vidtv: fix some warnings
media: vidtv: fix kernel-doc markups
media: [next] media: vidtv: fix a read from an object after it has been freed
media: vb2: set cache sync hints when init buffers
media: pulse8-cec: add support for FW v10 and up
media: pulse8-cec: fix duplicate free at disconnect or probe error
media: mtk-cir: fix calculation of chk period
Xiaochen Shen [Fri, 4 Dec 2020 06:27:59 +0000 (14:27 +0800)]
x86/resctrl: Fix incorrect local bandwidth when mba_sc is enabled
The MBA software controller (mba_sc) is a feedback loop which
periodically reads MBM counters and tries to restrict the bandwidth
below a user-specified value. It tags along the MBM counter overflow
handler to do the updates with 1s interval in mbm_update() and
update_mba_bw().
The purpose of mbm_update() is to periodically read the MBM counters to
make sure that the hardware counter doesn't wrap around more than once
between user samplings. mbm_update() calls __mon_event_count() for local
bandwidth updating when mba_sc is not enabled, but calls mbm_bw_count()
instead when mba_sc is enabled. __mon_event_count() will not be called
for local bandwidth updating in MBM counter overflow handler, but it is
still called when reading MBM local bandwidth counter file
'mbm_local_bytes', the call path is as below:
In __mon_event_count(), m->chunks is updated by delta chunks which is
calculated from previous MSR value (m->prev_msr) and current MSR value.
When mba_sc is enabled, m->chunks is also updated in mbm_update() by
mistake by the delta chunks which is calculated from m->prev_bw_msr
instead of m->prev_msr. But m->chunks is not used in update_mba_bw() in
the mba_sc feedback loop.
When reading MBM local bandwidth counter file, m->chunks was changed
unexpectedly by mbm_bw_count(). As a result, the incorrect local
bandwidth counter which calculated from incorrect m->chunks is shown to
the user.
Fix this by removing incorrect m->chunks updating in mbm_bw_count() in
MBM counter overflow handler, and always calling __mon_event_count() in
mbm_update() to make sure that the hardware local bandwidth counter
doesn't wrap around.
Test steps:
# Run workload with aggressive memory bandwidth (e.g., 10 GB/s)
git clone https://github.com/intel/intel-cmt-cat && cd intel-cmt-cat
&& make
./tools/membw/membw -c 0 -b 10000 --read
# Enable MBA software controller
mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
# Create control group c1
mkdir /sys/fs/resctrl/c1
# Set MB throttle to 6 GB/s
echo "MB:0=6000;1=6000" > /sys/fs/resctrl/c1/schemata
# Write PID of the workload to tasks file
echo `pidof membw` > /sys/fs/resctrl/c1/tasks
# Read local bytes counters twice with 1s interval, the calculated
# local bandwidth is not as expected (approaching to 6 GB/s):
local_1=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
sleep 1
local_2=`cat /sys/fs/resctrl/c1/mon_data/mon_L3_00/mbm_local_bytes`
echo "local b/w (bytes/s):" `expr $local_2 - $local_1`
Paolo Bonzini [Thu, 10 Dec 2020 16:34:24 +0000 (11:34 -0500)]
Merge tag 'kvmarm-fixes-5.10-5' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD
kvm/arm64 fixes for 5.10, take #5
- Don't leak page tables on PTE update
- Correctly invalidate TLBs on table to block transition
- Only update permissions if the fault level matches the
expected mapping size
Jens Axboe [Thu, 10 Dec 2020 14:08:22 +0000 (07:08 -0700)]
Merge branch 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md into block-5.10
Pull MD fixes from Song:
"This is to fix raid10 data corruption [1] in 5.10-rc7."
* 'md-fixes' of https://git.kernel.org/pub/scm/linux/kernel/git/song/md:
Revert "md: add md_submit_discard_bio() for submitting discard bio"
Revert "md/raid10: extend r10bio devs to raid disks"
Revert "md/raid10: pull codes that wait for blocked dev into one function"
Revert "md/raid10: improve raid10 discard request"
Revert "md/raid10: improve discard request for far layout"
Revert "dm raid: remove unnecessary discard limits for raid10"
Arvind Sankar [Wed, 11 Nov 2020 16:09:45 +0000 (11:09 -0500)]
x86/mm/mem_encrypt: Fix definition of PMD_FLAGS_DEC_WP
The PAT bit is in different locations for 4k and 2M/1G page table
entries.
Add a definition for _PAGE_LARGE_CACHE_MASK to represent the three
caching bits (PWT, PCD, PAT), similar to _PAGE_CACHE_MASK for 4k pages,
and use it in the definition of PMD_FLAGS_DEC_WP to get the correct PAT
index for write-protected pages.
Fixes: 6ebcb060713f ("x86/mm: Add support to encrypt the kernel in-place") Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20201111160946.147341-1-nivedita@alum.mit.edu
Damien Le Moal [Wed, 9 Dec 2020 11:16:10 +0000 (20:16 +0900)]
zonefs: fix page reference and BIO leak
In zonefs_file_dio_append(), the pages obtained using
bio_iov_iter_get_pages() are not released on completion of the
REQ_OP_APPEND BIO, nor when bio_iov_iter_get_pages() fails.
Furthermore, a call to bio_put() is missing when
bio_iov_iter_get_pages() fails.
Fix these resource leaks by adding BIO resource release code (bio_put()i
and bio_release_pages()) at the end of the function after the BIO
execution and add a jump to this resource cleanup code in case of
bio_iov_iter_get_pages() failure.
While at it, also fix the call to task_io_account_write() to be passed
the correct BIO size instead of bio_iov_iter_get_pages() return value.
Reported-by: Christoph Hellwig <hch@lst.de> Fixes: 02ef12a663c7 ("zonefs: use REQ_OP_ZONE_APPEND for sync DIO") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <damien.lemoal@wdc.com> Reviewed-by: Chaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
Chris Chiu [Thu, 10 Dec 2020 04:24:47 +0000 (20:24 -0800)]
Input: i8042 - add Acer laptops to the i8042 reset list
The touchpad operates in Basic Mode by default in the Acer BIOS
setup, but some Aspire/TravelMate models require the i8042 to be
reset in order to be correctly detected.
Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
Matthew Ruffell reported data corruption in raid10 due to the changes
in discard handling [1]. Revert these changes before we find a proper fix.
[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262/ Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Xiao Ni <xni@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Song Liu <songliubraving@fb.com>
Fixes: 61aec25a6db5 ("cls_flower: Support filtering on multiple MPLS Label Stack Entries") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
David S. Miller [Thu, 10 Dec 2020 02:48:29 +0000 (18:48 -0800)]
Merge branch '1GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
Tony Nguyen says:
====================
Intel Wired LAN Driver Updates 2020-12-09
This series contains updates to igb, ixgbe, i40e, and ice drivers.
Sven Auhagen fixes issues with igb XDP: return correct error value in XDP
xmit back, increase header padding to include space for double VLAN, add
an extack error when Rx buffer is too small for frame size, set metasize if
it is set in xdp, change xdp_do_flush_map to xdp_do_flush, and update
trans_start to avoid possible Tx timeout.
Björn fixes an issue where an Rx buffer can be reused prematurely with
XDP redirect for ixgbe, i40e, and ice drivers.
The following are changes since commit 323a391a220c4a234cb1e678689d7f4c3b73f863:
can: isotp: isotp_setsockopt(): block setsockopt on bound sockets
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue 1GbE
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Dmitry Torokhov [Thu, 10 Dec 2020 01:59:53 +0000 (17:59 -0800)]
Input: cros_ec_keyb - send 'scancodes' in addition to key events
To let userspace know what 'scancodes' should be used in EVIOCGKEYCODE
and EVIOCSKEYCODE ioctls, we should send EV_MSC/MSC_SCAN events in
addition to EV_KEY/KEY_* events. The driver already declared MSC_SCAN
capability, so it is only matter of actually sending the events.
David S. Miller [Thu, 10 Dec 2020 00:44:35 +0000 (16:44 -0800)]
Merge branch 'mlx4_en-fixes'
Tariq Toukan says:
====================
mlx4_en fixes
This patchset by Moshe contains fixes to the mlx4 Eth driver,
addressing issues in restart flow.
Patch 1 protects the restart task from being rescheduled while active.
Please queue for -stable >= v2.6.
Patch 2 reconstructs SQs stuck in error state, and adds prints for improved
debuggability.
Please queue for -stable >= v3.12.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Wed, 9 Dec 2020 13:03:39 +0000 (15:03 +0200)]
net/mlx4_en: Handle TX error CQE
In case error CQE was found while polling TX CQ, the QP is in error
state and all posted WQEs will generate error CQEs without any data
transmitted. Fix it by reopening the channels, via same method used for
TX timeout handling.
In addition add some more info on error CQE and WQE for debug.
Fixes: bd2f631d7c60 ("net/mlx4_en: Notify user when TX ring in error state") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Moshe Shemesh [Wed, 9 Dec 2020 13:03:38 +0000 (15:03 +0200)]
net/mlx4_en: Avoid scheduling restart task if it is already running
Add restarting state flag to avoid scheduling another restart task while
such task is already running. Change task name from watchdog_task to
restart_task to better fit the task role.
Fixes: 1e338db56e5a ("mlx4_en: Fix a race at restart task") Signed-off-by: Moshe Shemesh <moshe@mellanox.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Neal Cardwell [Wed, 9 Dec 2020 03:57:59 +0000 (22:57 -0500)]
tcp: fix cwnd-limited bug for TSO deferral where we send nothing
When cwnd is not a multiple of the TSO skb size of N*MSS, we can get
into persistent scenarios where we have the following sequence:
(1) ACK for full-sized skb of N*MSS arrives
-> tcp_write_xmit() transmit full-sized skb with N*MSS
-> move pacing release time forward
-> exit tcp_write_xmit() because pacing time is in the future
(2) TSQ callback or TCP internal pacing timer fires
-> try to transmit next skb, but TSO deferral finds remainder of
available cwnd is not big enough to trigger an immediate send
now, so we defer sending until the next ACK.
(3) repeat...
So we can get into a case where we never mark ourselves as
cwnd-limited for many seconds at a time, even with
bulk/infinite-backlog senders, because:
o In case (1) above, every time in tcp_write_xmit() we have enough
cwnd to send a full-sized skb, we are not fully using the cwnd
(because cwnd is not a multiple of the TSO skb size). So every time we
send data, we are not cwnd limited, and so in the cwnd-limited
tracking code in tcp_cwnd_validate() we mark ourselves as not
cwnd-limited.
o In case (2) above, every time in tcp_write_xmit() that we try to
transmit the "remainder" of the cwnd but defer, we set the local
variable is_cwnd_limited to true, but we do not send any packets, so
sent_pkts is zero, so we don't call the cwnd-limited logic to update
tp->is_cwnd_limited.
Fixes: ca8a22634381 ("tcp: make cwnd-limited checks measurement-based, and gentler") Reported-by: Ingemar Johansson <ingemar.s.johansson@ericsson.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20201209035759.1225145-1-ncardwell.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Chris Mi [Tue, 8 Dec 2020 02:48:35 +0000 (10:48 +0800)]
net: flow_offload: Fix memory leak for indirect flow block
The offending commit introduces a cleanup callback that is invoked
when the driver module is removed to clean up the tunnel device
flow block. But it returns on the first iteration of the for loop.
The remaining indirect flow blocks will never be freed.
Fixes: 1fac52da5942 ("net: flow_offload: consolidate indirect flow_block infrastructure") CC: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Chris Mi <cmi@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com>
Wei Wang [Tue, 8 Dec 2020 17:55:08 +0000 (09:55 -0800)]
tcp: Retain ECT bits for tos reflection
For DCTCP, we have to retain the ECT bits set by the congestion control
algorithm on the socket when reflecting syn TOS in syn-ack, in order to
make ECN work properly.
Fixes: ac8f1710c12b ("tcp: reflect tos value received in SYN to the socket") Reported-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Wei Wang <weiwan@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Michal Kubecek [Tue, 8 Dec 2020 22:13:51 +0000 (23:13 +0100)]
ethtool: fix stack overflow in ethnl_parse_bitset()
Syzbot reported a stack overflow in bitmap_from_arr32() called from
ethnl_parse_bitset() when bitset from netlink message is longer than
target bitmap length. While ethnl_compact_sanity_checks() makes sure that
trailing part is all zeros (i.e. the request does not try to touch bits
kernel does not recognize), we also need to cap change_bits to nbits so
that we don't try to write past the prepared bitmaps.
Björn Töpel [Tue, 25 Aug 2020 17:27:36 +0000 (19:27 +0200)]
ice: avoid premature Rx buffer reuse
The page recycle code, incorrectly, relied on that a page fragment
could not be freed inside xdp_do_redirect(). This assumption leads to
that page fragments that are used by the stack/XDP redirect can be
reused and overwritten.
To avoid this, store the page count prior invoking xdp_do_redirect().
Fixes: efc2214b6047 ("ice: Add support for XDP") Reported-and-analyzed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Björn Töpel [Tue, 25 Aug 2020 17:27:35 +0000 (19:27 +0200)]
ixgbe: avoid premature Rx buffer reuse
The page recycle code, incorrectly, relied on that a page fragment
could not be freed inside xdp_do_redirect(). This assumption leads to
that page fragments that are used by the stack/XDP redirect can be
reused and overwritten.
To avoid this, store the page count prior invoking xdp_do_redirect().
Fixes: 6453073987ba ("ixgbe: add initial support for xdp redirect") Reported-and-analyzed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Björn Töpel [Tue, 25 Aug 2020 17:27:34 +0000 (19:27 +0200)]
i40e: avoid premature Rx buffer reuse
The page recycle code, incorrectly, relied on that a page fragment
could not be freed inside xdp_do_redirect(). This assumption leads to
that page fragments that are used by the stack/XDP redirect can be
reused and overwritten.
To avoid this, store the page count prior invoking xdp_do_redirect().
Longer explanation:
Intel NICs have a recycle mechanism. The main idea is that a page is
split into two parts. One part is owned by the driver, one part might
be owned by someone else, such as the stack.
t0: Page is allocated, and put on the Rx ring
+---------------
used by NIC ->| upper buffer
(rx_buffer) +---------------
| lower buffer
+---------------
page count == USHRT_MAX
rx_buffer->pagecnt_bias == USHRT_MAX
t1: Buffer is received, and passed to the stack (e.g.)
+---------------
| upper buff (skb)
+---------------
used by NIC ->| lower buffer
(rx_buffer) +---------------
page count == USHRT_MAX
rx_buffer->pagecnt_bias == USHRT_MAX - 1
t2: Buffer is received, and redirected
+---------------
| upper buff (skb)
+---------------
used by NIC ->| lower buffer
(rx_buffer) +---------------
This means that buffer *cannot* be flipped/reused, because the skb is
still using it.
The problem arises when xdp_do_redirect() actually frees the
segment. Then we get:
page count == USHRT_MAX - 1
rx_buffer->pagecnt_bias == USHRT_MAX - 2
From a recycle perspective, the buffer can be flipped and reused,
which means that the skb data area is passed to the Rx HW ring!
To work around this, the page count is stored prior calling
xdp_do_redirect().
Note that this is not optimal, since the NIC could actually reuse the
"lower buffer" again. However, then we need to track whether
XDP_REDIRECT consumed the buffer or not.
Fixes: d9314c474d4f ("i40e: add support for XDP_REDIRECT") Reported-and-analyzed-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Tested-by: George Kuruvinakunnel <george.kuruvinakunnel@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
Sven Auhagen [Wed, 11 Nov 2020 17:04:53 +0000 (18:04 +0100)]
igb: avoid transmit queue timeout in xdp path
Since we share the transmit queue with the network stack,
it is possible that we run into a transmit queue timeout.
This will reset the queue.
This happens under high load when XDP is using the
transmit queue pretty much exclusively.
netdev_start_xmit() sets the trans_start variable of the
transmit queue to jiffies which is later utilized by dev_watchdog(),
so to avoid timeout, let stack know that XDP xmit happened by
bumping the trans_start within XDP Tx routines to jiffies.
Fixes: 9cbc948b5a20 ("igb: add XDP support") Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Sven Auhagen <sven.auhagen@voleatech.de> Tested-by: Sandeep Penigalapati <sandeep.penigalapati@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>