Amir Goldstein [Wed, 6 Jan 2021 07:52:36 +0000 (09:52 +0200)]
nfsd: report per-export stats
Collect some nfsd stats per export in addition to the global stats.
A new nfsdfs export_stats file is created. It uses the same ops as the
exports file to iterate the export entries and we use the file's name to
determine the reported info per export. For example:
Amir Goldstein [Wed, 6 Jan 2021 07:52:34 +0000 (09:52 +0200)]
nfsd: remove unused stats counters
Commit 501cb1849f86 ("nfsd: rip out the raparms cache") removed the
code that updates read-ahead cache stats counters,
commit 8bbfa9f3889b ("knfsd: remove the nfsd thread busy histogram")
removed code that updates the thread busy stats counters back in 2009
and code that updated filehandle cache stats was removed back in 2002.
Remove the unused stats counters from nfsd_stats struct and print
hardcoded zeros in /proc/net/rpc/nfsd.
Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Tue, 20 Oct 2020 14:08:19 +0000 (10:08 -0400)]
NFSD: Remove argument length checking in nfsd_dispatch()
Now that the argument decoders for NFSv2 and NFSv3 use the
xdr_stream mechanism, the version-specific length checking logic in
nfsd_dispatch() is no longer necessary.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Wed, 21 Oct 2020 16:21:25 +0000 (12:21 -0400)]
NFSD: Update the NFSv2 READLINK argument decoder to use struct xdr_stream
If the code that sets up the sink buffer for nfsd_readlink() is
moved adjacent to the nfsd_readlink() call site that uses it, then
the only argument is a file handle, and the fhandle decoder can be
used instead.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Mon, 19 Oct 2020 17:23:52 +0000 (13:23 -0400)]
NFSD: Update READDIR3args decoders to use struct xdr_stream
As an additional clean up, neither nfsd3_proc_readdir() nor
nfsd3_proc_readdirplus() make use of the dircount argument, so
remove it from struct nfsd3_readdirargs.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Tue, 17 Nov 2020 14:50:23 +0000 (09:50 -0500)]
NFSD: Add helper to set up the pages where the dirlist is encoded
De-duplicate some code that is used by both READDIR and READDIRPLUS
to build the dirlist in the Reply. Because this code is not related
to decoding READ arguments, it is moved to a more appropriate spot.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Tue, 10 Nov 2020 15:24:39 +0000 (10:24 -0500)]
NFSD: Fix returned READDIR offset cookie
Code inspection shows that the server's NFSv3 READDIR implementation
handles offset cookies slightly differently than the NFSv2 READDIR,
NFSv3 READDIRPLUS, and NFSv4 READDIR implementations,
and there doesn't seem to be any need for this difference.
As a clean up, I copied the logic from nfsd3_proc_readdirplus().
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Thu, 22 Oct 2020 15:14:55 +0000 (11:14 -0400)]
NFSD: Update WRITE3arg decoder to use struct xdr_stream
As part of the update, open code that sanity-checks the size of the
data payload against the length of the RPC Call message has to be
re-implemented to use xdr_stream infrastructure.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Chuck Lever [Thu, 17 Sep 2020 21:22:49 +0000 (17:22 -0400)]
SUNRPC: Make trace_svc_process() display the RPC procedure symbolically
The next few patches will employ these strings to help make server-
side trace logs more human-readable. A similar technique is already
in use in kernel RPC client code.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Linus Torvalds [Sun, 24 Jan 2021 20:30:14 +0000 (12:30 -0800)]
Merge tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Pull io_uring fixes from Jens Axboe:
"Still need a final cancelation fix that isn't quite done done,
expected in the next day or two. That said, this contains:
- Wakeup fix for IOPOLL requests
- SQPOLL split close op handling fix
- Ensure that any use of io_uring fd itself is marked as inflight
- Short non-regular file read fix (Pavel)
- Fix up bad false positive warning (Pavel)
- SQPOLL fixes (Pavel)
- In-flight removal fix (Pavel)"
* tag 'io_uring-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
io_uring: account io_uring internal files as REQ_F_INFLIGHT
io_uring: fix sleeping under spin in __io_clean_op
io_uring: fix short read retries for non-reg files
io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
io_uring: fix skipping disabling sqo on exec
io_uring: fix uring_flush in exit_files() warning
io_uring: fix false positive sqo warning on flush
io_uring: iopoll requests should also wake task ->in_idle state
Linus Torvalds [Sun, 24 Jan 2021 20:24:35 +0000 (12:24 -0800)]
Merge tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block
Pull block fixes from Jens Axboe:
- NVMe pull request from Christoph:
- fix a status code in nvmet (Chaitanya Kulkarni)
- avoid double completions in nvme-rdma/nvme-tcp (Chao Leng)
- fix the CMB support to cope with NVMe 1.4 controllers (Klaus Jensen)
- fix PRINFO handling in the passthrough ioctl (Revanth Rajashekar)
- fix a double DMA unmap in nvme-pci
* tag 'block-5.11-2021-01-24' of git://git.kernel.dk/linux-block:
lightnvm: fix memory leak when submit fails
nvme-pci: fix error unwind in nvme_map_data
nvme-pci: refactor nvme_unmap_data
md: Set prev_flush_start and flush_bio in an atomic way
nvmet: set right status on error in id-ns handler
nvme-pci: allow use of cmb on v1.4 controllers
nvme-tcp: avoid request double completion for concurrent nvme_tcp_timeout
nvme-rdma: avoid request double completion for concurrent nvme_rdma_timeout
nvme: check the PRINFO bit before deciding the host buffer length
Linus Torvalds [Sun, 24 Jan 2021 20:16:34 +0000 (12:16 -0800)]
Merge branch 'akpm' (patches from Andrew)
Merge misc fixes from Andrew Morton:
"18 patches.
Subsystems affected by this patch series: mm (pagealloc, memcg, kasan,
memory-failure, and highmem), ubsan, proc, and MAINTAINERS"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
MAINTAINERS: add a couple more files to the Clang/LLVM section
proc_sysctl: fix oops caused by incorrect command parameters
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
mips/mm/highmem: use set_pte() for kmap_local()
mm/highmem: prepare for overriding set_pte_at()
sparc/mm/highmem: flush cache and TLB
mm: fix page reference leak in soft_offline_page()
ubsan: disable unsigned-overflow check for i386
kasan, mm: fix resetting page_alloc tags for HW_TAGS
kasan, mm: fix conflicts with init_on_alloc/free
kasan: fix HW_TAGS boot parameters
kasan: fix incorrect arguments passing in kasan_add_zero_shadow
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
mm: fix numa stats for thp migration
mm: memcg: fix memcg file_dirty numa stat
mm: memcg/slab: optimize objcg stock draining
mm: fix initialization of struct page for holes in memory layout
x86/setup: don't remove E820_TYPE_RAM for pfn 0
Linus Torvalds [Sun, 24 Jan 2021 19:26:46 +0000 (11:26 -0800)]
Merge tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
Pull char/misc driver fixes from Greg KH:
"Here are some small char/misc driver fixes for 5.11-rc5:
- habanalabs driver fixes
- phy driver fixes
- hwtracing driver fixes
- rtsx cardreader driver fix
All of these have been in linux-next with no reported issues"
* tag 'char-misc-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc:
misc: rtsx: init value of aspm_enabled
habanalabs: disable FW events on device removal
habanalabs: fix backward compatibility of idle check
habanalabs: zero pci counters packet before submit to FW
intel_th: pci: Add Alder Lake-P support
stm class: Fix module init return on allocation failure
habanalabs: prevent soft lockup during unmap
habanalabs: fix reset process in case of failures
habanalabs: fix dma_addr passed to dma_mmap_coherent
phy: mediatek: allow compile-testing the dsi phy
phy: cpcap-usb: Fix warning for missing regulator_disable
PHY: Ingenic: fix unconditional build of phy-ingenic-usb
Linus Torvalds [Sun, 24 Jan 2021 19:02:01 +0000 (11:02 -0800)]
Merge tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging/IIO driver fixes from Greg KH:
"Here are some IIO driver fixes for 5.11-rc5 to resolve some reported
problems.
Nothing major, just a few small fixes, all of these have been in
linux-next for a while and full details are in the shortlog"
* tag 'staging-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
iio: sx9310: Fix semtech,avg-pos-strength setting when > 16
iio: common: st_sensors: fix possible infinite loop in st_sensors_irq_thread
iio: ad5504: Fix setting power-down state
counter:ti-eqep: remove floor
drivers: iio: temperature: Add delay after the addressed reset command in mlx90632.c
iio: adc: ti_am335x_adc: remove omitted iio_kfifo_free()
dt-bindings: iio: accel: bma255: Fix bmc150/bmi055 compatible
iio: sx9310: Off by one in sx9310_read_thresh()
Linus Torvalds [Sun, 24 Jan 2021 18:56:45 +0000 (10:56 -0800)]
Merge tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty
Pull tty/serial fixes from Greg KH:
"Here are three small tty/serial fixes for 5.11-rc5 to resolve reported
problems:
- two patches to fix up writing to ttys with splice
- mvebu-uart driver fix for reported problem
All of these have been in linux-next with no reported problems"
* tag 'tty-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty:
tty: fix up hung_up_tty_write() conversion
tty: implement write_iter
serial: mvebu-uart: fix tx lost characters at power off
Linus Torvalds [Sun, 24 Jan 2021 18:54:54 +0000 (10:54 -0800)]
Merge tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb
Pull USB fixes from Greg KH:
"Here are some small USB driver fixes for 5.11-rc5. They resolve:
- xhci issues for some reported problems
- ehci driver issue for one specific device
- USB gadget fixes for some reported problems
- cdns3 driver fixes for issues reported
- MAINTAINERS file update
- thunderbolt minor fix
All of these have been in linux-next with no reported issues"
* tag 'usb-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb:
usb: bdc: Make bdc pci driver depend on BROKEN
xhci: tegra: Delay for disabling LFPS detector
xhci: make sure TRB is fully written before giving it to the controller
usb: udc: core: Use lock when write to soft_connect
USB: gadget: dummy-hcd: Fix errors in port-reset handling
usb: gadget: aspeed: fix stop dma register setting.
USB: ehci: fix an interrupt calltrace error
ehci: fix EHCI host controller initialization sequence
MAINTAINERS: update Peter Chen's email address
thunderbolt: Drop duplicated 0x prefix from format string
MAINTAINERS: Update address for Cadence USB3 driver
usb: cdns3: imx: improve driver .remove API
usb: cdns3: imx: fix can't create core device the second time issue
usb: cdns3: imx: fix writing read-only memory issue
Xiaoming Ni [Sun, 24 Jan 2021 05:02:16 +0000 (21:02 -0800)]
proc_sysctl: fix oops caused by incorrect command parameters
The process_sysctl_arg() does not check whether val is empty before
invoking strlen(val). If the command line parameter () is incorrectly
configured and val is empty, oops is triggered.
For example:
"hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is
triggered. The call stack is as follows:
Kernel command line: .... hung_task_panic
......
Call trace:
__pi_strlen+0x10/0x98
parse_args+0x278/0x344
do_sysctl_args+0x8c/0xfc
kernel_init+0x5c/0xf4
ret_from_fork+0x10/0x30
To fix it, check whether "val" is empty when "phram" is a sysctl field.
Error codes are returned in the failure branch, and error logs are
generated by parse_args().
Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com Fixes: 3db978d480e2843 ("kernel/sysctl: support setting sysctl parameters from kernel command line") Signed-off-by: Xiaoming Ni <nixiaoming@huawei.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: <stable@vger.kernel.org> [5.8+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Sun, 24 Jan 2021 05:02:11 +0000 (21:02 -0800)]
powerpc/mm/highmem: use __set_pte_at() for kmap_local()
The original PowerPC highmem mapping function used __set_pte_at() to
denote that the mapping is per CPU. This got lost with the conversion
to the generic implementation.
Override the default map function.
Link: https://lkml.kernel.org/r/20210112170411.281464308@linutronix.de Fixes: 47da42b27a56 ("powerpc/mm/highmem: Switch to generic kmap atomic") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thomas Gleixner [Sun, 24 Jan 2021 05:02:02 +0000 (21:02 -0800)]
mm/highmem: prepare for overriding set_pte_at()
The generic kmap_local() map function uses set_pte_at(), but MIPS requires
set_pte() and PowerPC wants __set_pte_at().
Provide arch_kmap_local_set_pte() and default it to set_pte_at().
Link: https://lkml.kernel.org/r/20210112170411.056306194@linutronix.de Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: Andreas Larsson <andreas@gaisler.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Paul Cercueil <paul@crapouillou.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Dan Williams [Sun, 24 Jan 2021 05:01:52 +0000 (21:01 -0800)]
mm: fix page reference leak in soft_offline_page()
The conversion to move pfn_to_online_page() internal to
soft_offline_page() missed that the get_user_pages() reference taken by
the madvise() path needs to be dropped when pfn_to_online_page() fails.
Note the direct sysfs-path to soft_offline_page() does not perform a
get_user_pages() lookup.
When soft_offline_page() is handed a pfn_valid() && !pfn_to_online_page()
pfn the kernel hangs at dax-device shutdown due to a leaked reference.
Link: https://lkml.kernel.org/r/161058501210.1840162.8108917599181157327.stgit@dwillia2-desk3.amr.corp.intel.com Fixes: feec24a6139d ("mm, soft-offline: convert parameter to pfn") Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Qian Cai <cai@lca.pw> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Konovalov [Sun, 24 Jan 2021 05:01:43 +0000 (21:01 -0800)]
kasan, mm: fix resetting page_alloc tags for HW_TAGS
A previous commit added resetting KASAN page tags to
kernel_init_free_pages() to avoid false-positives due to accesses to
metadata with the hardware tag-based mode.
That commit did reset page tags before the metadata access, but didn't
restore them after. As the result, KASAN fails to detect bad accesses
to page_alloc allocations on some configurations.
Fix this by recovering the tag after the metadata access.
Link: https://lkml.kernel.org/r/02b5bcd692e912c27d484030f666b350ad7e4ae4.1611074450.git.andreyknvl@google.com Fixes: aa1ef4d7b3f6 ("kasan, mm: reset tags when accessing metadata") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Marco Elver <elver@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Peter Collingbourne <pcc@google.com> Cc: Evgenii Stepanov <eugenis@google.com> Cc: Branislav Rankov <Branislav.Rankov@arm.com> Cc: Kevin Brodsky <kevin.brodsky@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Andrey Konovalov [Sun, 24 Jan 2021 05:01:38 +0000 (21:01 -0800)]
kasan, mm: fix conflicts with init_on_alloc/free
A few places where SLUB accesses object's data or metadata were missed
in a previous patch. This leads to false positives with hardware
tag-based KASAN when bulk allocations are used with init_on_alloc/free.
Fix the false-positives by resetting pointer tags during these accesses.
(The kasan_reset_tag call is removed from slab_alloc_node, as it's added
into maybe_wipe_obj_freeptr.)
Andrey Konovalov [Sun, 24 Jan 2021 05:01:34 +0000 (21:01 -0800)]
kasan: fix HW_TAGS boot parameters
The initially proposed KASAN command line parameters are redundant.
This change drops the complex "kasan.mode=off/prod/full" parameter and
adds a simpler kill switch "kasan=off/on" instead. The new parameter
together with the already existing ones provides a cleaner way to
express the same set of features.
The full set of parameters with this change:
kasan=off/on - whether KASAN is enabled
kasan.fault=report/panic - whether to only print a report or also panic
kasan.stacktrace=off/on - whether to collect alloc/free stack traces
Lecopzer Chen [Sun, 24 Jan 2021 05:01:25 +0000 (21:01 -0800)]
kasan: fix unaligned address is unhandled in kasan_remove_zero_shadow
During testing kasan_populate_early_shadow and kasan_remove_zero_shadow,
if the shadow start and end address in kasan_remove_zero_shadow() is not
aligned to PMD_SIZE, the remain unaligned PTE won't be removed.
0xffffffbf80000000 ~ 0xffffffbfbdf80000 will not be removed because in
kasan_remove_pud_table(), kasan_pmd_table(*pud) is true but the next
address is 0xffffffbfbdf80000 which is not aligned to PUD_SIZE.
In the correct condition, this should fallback to the next level
kasan_remove_pmd_table() but the condition flow always continue to skip
the unaligned part.
Fix by correcting the condition when next and addr are neither aligned.
Link: https://lkml.kernel.org/r/20210103135621.83129-1-lecopzer@gmail.com Fixes: 0207df4fa1a86 ("kernel/memremap, kasan: make ZONE_DEVICE with work with KASAN") Signed-off-by: Lecopzer Chen <lecopzer.chen@mediatek.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: YJ Chiang <yj.chiang@mediatek.com> Cc: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Sun, 24 Jan 2021 18:17:03 +0000 (10:17 -0800)]
Merge tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool fixes from Borislav Petkov:
- Adjust objtool to handle a recent binutils change to not generate
unused symbols anymore.
- Revert the fail-the-build-on-fatal-errors objtool strategy for now
due to the ever-increasing matrix of supported toolchains/plugins and
them causing too many such fatal errors currently.
- Do not add empty symbols to objdump's rbtree to accommodate clang
removing section symbols.
* tag 'objtool_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
objtool: Don't fail on missing symbol table
objtool: Don't fail the kernel build on fatal errors
objtool: Don't add empty symbols to the rbtree
Linus Torvalds [Sun, 24 Jan 2021 18:09:20 +0000 (10:09 -0800)]
Merge tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler fixes from Borislav Petkov:
- Correct the marking of kthreads which are supposed to run on a
specific, single CPU vs such which are affine to only one CPU, mark
per-cpu workqueue threads as such and make sure that marking
"survives" CPU hotplug. Fix CPU hotplug issues with such kthreads.
- A fix to not push away tasks on CPUs coming online.
- Have workqueue CPU hotplug code use cpu_possible_mask when breaking
affinity on CPU offlining so that pending workers can finish on newly
arrived onlined CPUs too.
- Dump tasks which haven't vacated a CPU which is currently being
unplugged.
- Register a special scale invariance callback which gets called on
resume from RAM to read out APERF/MPERF after resume and thus make
the schedutil scaling governor more precise.
* tag 'sched_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Relax the set_cpus_allowed_ptr() semantics
sched: Fix CPU hotplug / tighten is_per_cpu_kthread()
sched: Prepare to use balance_push in ttwu()
workqueue: Restrict affinity change to rescuer
workqueue: Tag bound workers with KTHREAD_IS_PER_CPU
kthread: Extract KTHREAD_IS_PER_CPU
sched: Don't run cpu-online with balance_push() enabled
workqueue: Use cpu_possible_mask instead of cpu_active_mask to break affinity
sched/core: Print out straggler tasks in sched_cpu_dying()
x86: PM: Register syscore_ops for scale invariance
Linus Torvalds [Sun, 24 Jan 2021 17:46:05 +0000 (09:46 -0800)]
Merge tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Add a new Intel model number for Alder Lake
- Differentiate which aspects of the FPU state get saved/restored when
the FPU is used in-kernel and fix a boot crash on K7 due to early
MXCSR access before CR4.OSFXSR is even set.
- A couple of noinstr annotation fixes
- Correct die ID setting on AMD for users of topology information which
need the correct die ID
- A SEV-ES fix to handle string port IO to/from kernel memory properly
* tag 'x86_urgent_for_v5.11_rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Add another Alder Lake CPU to the Intel family
x86/mmx: Use KFPU_387 for MMX string operations
x86/fpu: Add kernel_fpu_begin_mask() to selectively initialize state
x86/topology: Make __max_die_per_package available unconditionally
x86: __always_inline __{rd,wr}msr()
x86/mce: Remove explicit/superfluous tracing
locking/lockdep: Avoid noinstr warning for DEBUG_LOCKDEP
locking/lockdep: Cure noinstr fail
x86/sev: Fix nonistr violation
x86/entry: Fix noinstr fail
x86/cpu/amd: Set __max_die_per_package on AMD
x86/sev-es: Handle string port IO to kernel memory properly
Linus Torvalds [Sun, 24 Jan 2021 17:40:51 +0000 (09:40 -0800)]
Merge tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc fixes from Michael Ellerman:
- Fix a bad interaction between the scv handling and the fallback L1D
flush, which could lead to user register corruption. Only affects
people using scv (~no one) on machines with old firmware that are
missing the L1D flush.
- Two small selftest fixes.
Thanks to Eirik Fuller, Libor Pechacek, Nicholas Piggin, Sandipan Das,
and Tulio Magno Quites Machado Filho.
* tag 'powerpc-5.11-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
powerpc/64s: fix scv entry fallback flush vs interrupt
selftests/powerpc: Only test lwm/stmw on big endian
selftests/powerpc: Fix exit status of pkey tests
Linus Torvalds [Sun, 24 Jan 2021 17:35:28 +0000 (09:35 -0800)]
Merge tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux
Pull misc fixes from Christian Brauner:
- Jann reported sparse complaints because of a missing __user
annotation in a helper we added way back when we added
pidfd_send_signal() to avoid compat syscall handling. Fix it.
- Yanfei replaces a reference in a comment to the _do_fork() helper I
removed a while ago with a reference to the new kernel_clone()
replacement
- Alexander Guril added a simple coding style fix
* tag 'for-linus-2021-01-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
kthread: remove comments about old _do_fork() helper
Kernel: fork.c: Fix coding style: Do not use {} around single-line statements
signal: Add missing __user annotation to copy_siginfo_from_user_any
Linus Torvalds [Sun, 24 Jan 2021 17:27:14 +0000 (09:27 -0800)]
Merge tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6
Pull cifs fixes from Steve French:
"An important signal handling patch for stable, and two small cleanup
patches"
* tag '5.11-rc4-smb3' of git://git.samba.org/sfrench/cifs-2.6:
cifs: do not fail __smb_send_rqst if non-fatal signals are pending
fs/cifs: Simplify bool comparison.
fs/cifs: Assign boolean values to a bool variable
Shakeel Butt [Sun, 24 Jan 2021 05:01:15 +0000 (21:01 -0800)]
mm: fix numa stats for thp migration
Currently the kernel is not correctly updating the numa stats for
NR_FILE_PAGES and NR_SHMEM on THP migration. Fix that.
For NR_FILE_DIRTY and NR_ZONE_WRITE_PENDING, although at the moment
there is no need to handle THP migration as kernel still does not have
write support for file THP but to be more future proof, this patch adds
the THP support for those stats as well.
Link: https://lkml.kernel.org/r/20210108155813.2914586-2-shakeelb@google.com Fixes: e71769ae52609 ("mm: enable thp migration for shmem thp") Signed-off-by: Shakeel Butt <shakeelb@google.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shakeel Butt [Sun, 24 Jan 2021 05:01:11 +0000 (21:01 -0800)]
mm: memcg: fix memcg file_dirty numa stat
The kernel updates the per-node NR_FILE_DIRTY stats on page migration
but not the memcg numa stats.
That was not an issue until recently the commit 5f9a4f4a7096 ("mm:
memcontrol: add the missing numa_stat interface for cgroup v2") exposed
numa stats for the memcg.
So fix the file_dirty per-memcg numa stat.
Link: https://lkml.kernel.org/r/20210108155813.2914586-1-shakeelb@google.com Fixes: 5f9a4f4a7096 ("mm: memcontrol: add the missing numa_stat interface for cgroup v2") Signed-off-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Roman Gushchin <guro@fb.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Roman Gushchin [Sun, 24 Jan 2021 05:01:07 +0000 (21:01 -0800)]
mm: memcg/slab: optimize objcg stock draining
Imran Khan reported a 16% regression in hackbench results caused by the
commit f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects
instead of pages"). The regression is noticeable in the case of a
consequent allocation of several relatively large slab objects, e.g.
skb's. As soon as the amount of stocked bytes exceeds PAGE_SIZE,
drain_obj_stock() and __memcg_kmem_uncharge() are called, and it leads
to a number of atomic operations in page_counter_uncharge().
The corresponding call graph is below (provided by Imran Khan):
Instead of directly uncharging the accounted kernel memory, it's
possible to refill the generic page-sized per-cpu stock instead. It's a
much faster operation, especially on a default hierarchy. As a bonus,
__memcg_kmem_uncharge_page() will also get faster, so the freeing of
page-sized kernel allocations (e.g. large kmallocs) will become faster.
A similar change has been done earlier for the socket memory by the
commit 475d0487a2ad ("mm: memcontrol: use per-cpu stocks for socket
memory uncharging").
Link: https://lkml.kernel.org/r/20210106042239.2860107-1-guro@fb.com Fixes: f2fe7b09a52b ("mm: memcg/slab: charge individual slab objects instead of pages") Signed-off-by: Roman Gushchin <guro@fb.com> Reported-by: Imran Khan <imran.f.khan@oracle.com> Tested-by: Imran Khan <imran.f.khan@oracle.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Michal Koutn <mkoutny@suse.com> Cc: Michal Koutný <mkoutny@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Sun, 24 Jan 2021 05:01:02 +0000 (21:01 -0800)]
mm: fix initialization of struct page for holes in memory layout
There could be struct pages that are not backed by actual physical
memory. This can happen when the actual memory bank is not a multiple
of SECTION_SIZE or when an architecture does not register memory holes
reserved by the firmware as memblock.memory.
Such pages are currently initialized using init_unavailable_mem()
function that iterates through PFNs in holes in memblock.memory and if
there is a struct page corresponding to a PFN, the fields if this page
are set to default values and the page is marked as Reserved.
init_unavailable_mem() does not take into account zone and node the page
belongs to and sets both zone and node links in struct page to zero.
On a system that has firmware reserved holes in a zone above ZONE_DMA,
for instance in a configuration below:
because there are pages in both ZONE_DMA32 and ZONE_DMA (unset zone link
in struct page) in the same pageblock.
Update init_unavailable_mem() to use zone constraints defined by an
architecture to properly setup the zone link and use node ID of the
adjacent range in memblock.memory to set the node link.
Link: https://lkml.kernel.org/r/20210111194017.22696-3-rppt@kernel.org Fixes: 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions rather that check each PFN") Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Reported-by: Andrea Arcangeli <aarcange@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Mike Rapoport [Sun, 24 Jan 2021 05:00:57 +0000 (21:00 -0800)]
x86/setup: don't remove E820_TYPE_RAM for pfn 0
Patch series "mm: fix initialization of struct page for holes in memory layout", v3.
Commit 73a6e474cb37 ("mm: memmap_init: iterate over memblock regions
rather that check each PFN") exposed several issues with the memory map
initialization and these patches fix those issues.
Initially there were crashes during compaction that Qian Cai reported
back in April [1]. It seemed back then that the problem was fixed, but
a few weeks ago Andrea Arcangeli hit the same bug [2] and there was an
additional discussion at [3].
The first 4Kb of memory is a BIOS owned area and to avoid its allocation
for the kernel it was not listed in e820 tables as memory. As the result,
pfn 0 was never recognised by the generic memory management and it is not
a part of neither node 0 nor ZONE_DMA.
If set_pfnblock_flags_mask() would be ever called for the pageblock
corresponding to the first 2Mbytes of memory, having pfn 0 outside of
ZONE_DMA would trigger
Along with reserving the first 4Kb in e820 tables, several first pages are
reserved with memblock in several places during setup_arch(). These
reservations are enough to ensure the kernel does not touch the BIOS area
and it is not necessary to remove E820_TYPE_RAM for pfn 0.
Remove the update of e820 table that changes the type of pfn 0 and move
the comment describing why it was done to trim_low_memory_range() that
reserves the beginning of the memory.
Link: https://lkml.kernel.org/r/20210111194017.22696-2-rppt@kernel.org Signed-off-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: David Hildenbrand <david@redhat.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@kernel.org> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Jens Axboe [Sat, 23 Jan 2021 22:49:31 +0000 (15:49 -0700)]
io_uring: account io_uring internal files as REQ_F_INFLIGHT
We need to actively cancel anything that introduces a potential circular
loop, where io_uring holds a reference to itself. If the file in question
is an io_uring file, then add the request to the inflight list.
Instead of cleaning files on overflow, return back overflow cancellation
into io_uring_cancel_files(). Previously it was racy to clean
REQ_F_OVERFLOW flag, but we got rid of it, and can do it through
repetitive attempts targeting all matching requests.
Reported-by: Abaci <abaci@linux.alibaba.com> Reported-by: Joseph Qi <joseph.qi@linux.alibaba.com> Cc: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
Linus Torvalds [Sat, 23 Jan 2021 20:02:58 +0000 (12:02 -0800)]
Merge branch 'mtd/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux
Pull mtd fixes from Miquel Raynal.
* 'mtd/fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux:
mtd: rawnand: omap: Use BCH private fields in the specific OOB layout
mtd: spinand: Fix MTD_OPS_AUTO_OOB requests
mtd: rawnand: intel: check the mtd name only after setting the variable
mtd: rawnand: nandsim: Fix the logic when selecting Hamming soft ECC engine
mtd: rawnand: gpmi: fix dst bit offset when extracting raw payload
Linus Torvalds [Sat, 23 Jan 2021 19:43:02 +0000 (11:43 -0800)]
Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux
Pull i2c fixes from Wolfram Sang:
"Another bunch of driver fixes"
* 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux:
i2c: sprd: depend on COMMON_CLK to fix compile tests
Revert "i2c: imx: Remove unused .id_table support"
i2c: octeon: check correct size of maximum RECV_LEN packet
i2c: tegra: Create i2c_writesl_vi() to use with VI I2C for filling TX FIFO
i2c: bpmp-tegra: Ignore unknown I2C_M flags
i2c: tegra: Wait for config load atomically while in ISR
Linus Torvalds [Sat, 23 Jan 2021 19:35:02 +0000 (11:35 -0800)]
Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI fixes from James Bottomley:
"Twelve minor fixes, all in drivers or doc.
Most of the fixes are pretty obvious (although we had two goes to get
the UFS sysfs doc right) and the biggest change is in the ufs driver
which they've extensively tested"
* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
scsi: ibmvfc: Set default timeout to avoid crash during migration
scsi: target: tcmu: Fix use-after-free of se_cmd->priv
scsi: fnic: Fix memleak in vnic_dev_init_devcmd2
scsi: libfc: Avoid invoking response handler twice if ep is already completed
scsi: scsi_transport_srp: Don't block target in failfast state
scsi: docs: ABI: sysfs-driver-ufs: Rectify table formatting
scsi: ufs: Fix tm request when non-fatal error happens
scsi: ufs: Fix livelock of ufshcd_clear_ua_wluns()
scsi: ibmvfc: Fix missing cast of ibmvfc_event pointer to u64 handle
scsi: ufs: ufshcd-pltfrm depends on HAS_IOMEM
scsi: megaraid_sas: Fix MEGASAS_IOC_FIRMWARE regression
scsi: docs: ABI: sysfs-driver-ufs: Add DeepSleep power mode
Linus Torvalds [Sat, 23 Jan 2021 19:25:33 +0000 (11:25 -0800)]
Merge tag 'linux-kselftest-kunit-fixes-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest
Pull kunit fixes from Shuah :
"Five fixes to the kunit tool and documentation from Daniel Latypov and
David Gow"
* tag 'linux-kselftest-kunit-fixes-5.11-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
kunit: tool: move kunitconfig parsing into __init__, make it optional
kunit: tool: fix minor typing issue with None status
kunit: tool: surface and address more typing issues
Documentation: kunit: include example of a parameterized test
kunit: tool: Fix spelling of "diagnostic" in kunit_parser
The original intent of returning an error in this function
in the patch:
"CIFS: Mask off signals when sending SMB packets"
was to avoid interrupting packet send in the middle of
sending the data (and thus breaking an SMB connection),
but we also don't want to fail the request for non-fatal
signals even before we have had a chance to try to
send it (the reported problem could be reproduced e.g.
by exiting a child process when the parent process was in
the midst of calling futimens to update a file's timestamps).
In addition, since the signal may remain pending when we enter the
sending loop, we may end up not sending the whole packet before
TCP buffers become full. In this case the code returns -EINTR
but what we need here is to return -ERESTARTSYS instead to
allow system calls to be restarted.
Fixes: b30c74c73c78 ("CIFS: Mask off signals when sending SMB packets") Cc: stable@vger.kernel.org # v5.1+ Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com> Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com> Signed-off-by: Steve French <stfrench@microsoft.com>
Linus Torvalds [Fri, 22 Jan 2021 22:31:00 +0000 (14:31 -0800)]
Merge tag 'for-5.11/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper fixes from Mike Snitzer:
- Fix DM integrity crash if "recalculate" used without "internal_hash"
- Fix DM integrity "recalculate" support to prevent recalculating
checksums if we use internal_hash or journal_hash with a key (e.g.
HMAC). Use of crypto as a means to prevent malicious corruption
requires further changes and was never a design goal for
dm-integrity's primary usecase of detecting accidental corruption.
- Fix a benign dm-crypt copy-and-paste bug introduced as part of a fix
that was merged for 5.11-rc4.
- Fix DM core's dm_get_device() to avoid filesystem lookup to get block
device (if possible).
* tag 'for-5.11/dm-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm:
dm: avoid filesystem lookup in dm_get_dev_t()
dm crypt: fix copy and paste bug in crypt_alloc_req_aead
dm integrity: conditionally disable "recalculate" feature
dm integrity: fix a crash if "recalculate" used without "internal_hash"
Linus Torvalds [Fri, 22 Jan 2021 21:55:00 +0000 (13:55 -0800)]
Merge tag 'perf-tools-fixes-v5.11-2-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull more perf tools fixes from Arnaldo Carvalho de Melo:
- Fix id index used in Intel PT for heterogeneous systems
- Fix overrun issue in 'perf script' for dynamically-allocated PMU type
number
- Fix 'perf stat' metrics containing the 'duration_time' synthetic
event
- Fix system PMU 'perf stat' metrics
* tag 'perf-tools-fixes-v5.11-2-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf script: Fix overrun issue for dynamically-allocated PMU type number
perf metricgroup: Fix system PMU metrics
perf metricgroup: Fix for metrics containing duration_time
perf evlist: Fix id index for heterogeneous systems
Linus Torvalds [Fri, 22 Jan 2021 21:51:17 +0000 (13:51 -0800)]
Merge tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
Pull arm64 fixes from Catalin Marinas:
- Correctly mask out bits 63:60 in a kernel tag check fault address
(specified as unknown by the architecture). Previously they were just
zeroed but for kernel pointers they need to be all ones.
- Fix a panic (unexpected kernel BRK exception) caused by kprobes being
reentered due to an interrupt.
* tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
arm64: kprobes: Fix Uexpected kernel BRK exception at EL1
kasan, arm64: fix pointer tags in KASAN reports
Linus Torvalds [Fri, 22 Jan 2021 21:47:25 +0000 (13:47 -0800)]
Merge tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client
Pull ceph fixes from Ilya Dryomov:
"A patch to zero out sensitive cryptographic data and two minor
cleanups prompted by the fact that a bunch of code was moved in this
cycle"
* tag 'ceph-for-5.11-rc5' of git://github.com/ceph/ceph-client:
libceph: fix "Boolean result is used in bitwise operation" warning
libceph, ceph: disambiguate ceph_connection_operations handlers
libceph: zero out session key and connection secret
Linus Torvalds [Fri, 22 Jan 2021 21:45:52 +0000 (13:45 -0800)]
Merge tag 'fixes-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock
Pull typo fix from Mike Rapoport:
"Fix typo in comment of memblock_phys_alloc_try_nid()"
* tag 'fixes-2021-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
mm/memblock: Fix typo in comment of memblock_phys_alloc_try_nid()
Linus Torvalds [Fri, 22 Jan 2021 21:38:40 +0000 (13:38 -0800)]
Merge tag 'platform-drivers-x86-v5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver fixes from Hans de Goede:
"A small collection of bug-fixes and model-specific quirks"
* tag 'platform-drivers-x86-v5.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86:
platform/x86: thinkpad_acpi: Add P53/73 firmware to fan_quirk_table for dual fan control
platform/x86: hp-wmi: Don't log a warning on HPWMI_RET_UNKNOWN_COMMAND errors
platform/x86: intel-vbtn: Drop HP Stream x360 Convertible PC 11 from allow-list
platform/x86: ideapad-laptop: Disable touchpad_switch for ELAN0634
platform/x86: amd-pmc: Fix CONFIG_DEBUG_FS check
platform/x86: thinkpad_acpi: correct palmsensor error checking
platform/x86: intel-vbtn: Support for tablet mode on Dell Inspiron 7352
platform/x86: touchscreen_dmi: Add swap-x-y quirk for Goodix touchscreen on Estar Beauty HD tablet
platform/x86: i2c-multi-instantiate: Don't create platform device for INT3515 ACPI nodes
platform/surface: SURFACE_PLATFORMS should depend on ACPI
platform/surface: surface_gpe: Fix non-PM_SLEEP build warnings
tools/power/x86/intel-speed-select: Set higher of cpuinfo_max_freq or base_frequency
tools/power/x86/intel-speed-select: Set scaling_max_freq to base_frequency
Pavel Begunkov [Thu, 21 Jan 2021 12:01:08 +0000 (12:01 +0000)]
io_uring: fix short read retries for non-reg files
Sockets and other non-regular files may actually expect short reads to
happen, don't retry reads for them. Because non-reg files don't set
FMODE_BUF_RASYNC and so it won't do second/retry do_read, we can filter
out those cases after first do_read() attempt with ret>0.
Jens Axboe [Tue, 19 Jan 2021 17:10:54 +0000 (10:10 -0700)]
io_uring: fix SQPOLL IORING_OP_CLOSE cancelation state
IORING_OP_CLOSE is special in terms of cancelation, since it has an
intermediate state where we've removed the file descriptor but hasn't
closed the file yet. For that reason, it's currently marked with
IO_WQ_WORK_NO_CANCEL to prevent cancelation. This ensures that the op
is always run even if canceled, to prevent leaving us with a live file
but an fd that is gone. However, with SQPOLL, since a cancel request
doesn't carry any resources on behalf of the request being canceled, if
we cancel before any of the close op has been run, we can end up with
io-wq not having the ->files assigned. This can result in the following
oops reported by Joseph:
Fix this by moving the IO_WQ_WORK_NO_CANCEL until _after_ we've modified
the fdtable. Canceling before this point is totally fine, and running
it in the io-wq context _after_ that point is also fine.
For 5.12, we'll handle this internally and get rid of the no-cancel
flag, as IORING_OP_CLOSE is the only user of it.
Cc: stable@vger.kernel.org Fixes: b5dba59e0cf7 ("io_uring: add support for IORING_OP_CLOSE") Reported-by: "Abaci <abaci@linux.alibaba.com>" Reviewed-and-tested-by: Joseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>