btrfs: don't flush from btrfs_delayed_inode_reserve_metadata
Calling btrfs_qgroup_reserve_meta_prealloc from
btrfs_delayed_inode_reserve_metadata can result in flushing delalloc
while holding a transaction and delayed node locks. This is deadlock
prone. In the past multiple commits:
*
bc5090a9a896 ("btrfs: qgroup: don't try to wait flushing if we're
already holding a transaction")
*
0c4dca4b2e93 ("btrfs: qgroup: don't commit transaction when we already
hold the handle")
Tried to solve various aspects of this but this was always a
whack-a-mole game. Unfortunately those 2 fixes don't solve a deadlock
scenario involving btrfs_delayed_node::mutex. Namely, one thread
can call btrfs_dirty_inode as a result of reading a file and modifying
its atime:
PID: 6963 TASK:
ffff8c7f3f94c000 CPU: 2 COMMAND: "test"
#0 __schedule at
ffffffffa529e07d
#1 schedule at
ffffffffa529e4ff
#2 schedule_timeout at
ffffffffa52a1bdd
#3 wait_for_completion at
ffffffffa529eeea <-- sleeps with delayed node mutex held
#4 start_delalloc_inodes at
ffffffffc0380db5
#5 btrfs_start_delalloc_snapshot at
ffffffffc0393836
#6 try_flush_qgroup at
ffffffffc03f04b2
#7 __btrfs_qgroup_reserve_meta at
ffffffffc03f5bb6 <-- tries to reserve space and starts delalloc inodes.
#8 btrfs_delayed_update_inode at
ffffffffc03e31aa <-- acquires delayed node mutex
#9 btrfs_update_inode at
ffffffffc0385ba8
#10 btrfs_dirty_inode at
ffffffffc038627b <-- TRANSACTIION OPENED
#11 touch_atime at
ffffffffa4cf0000
#12 generic_file_read_iter at
ffffffffa4c1f123
#13 new_sync_read at
ffffffffa4ccdc8a
#14 vfs_read at
ffffffffa4cd0849
#15 ksys_read at
ffffffffa4cd0bd1
#16 do_syscall_64 at
ffffffffa4a052eb
#17 entry_SYSCALL_64_after_hwframe at
ffffffffa540008c
This will cause an asynchronous work to flush the delalloc inodes to
happen which can try to acquire the same delayed_node mutex:
PID: 455 TASK:
ffff8c8085fa4000 CPU: 5 COMMAND: "kworker/u16:30"
#0 __schedule at
ffffffffa529e07d
#1 schedule at
ffffffffa529e4ff
#2 schedule_preempt_disabled at
ffffffffa529e80a
#3 __mutex_lock at
ffffffffa529fdcb <-- goes to sleep, never wakes up.
#4 btrfs_delayed_update_inode at
ffffffffc03e3143 <-- tries to acquire the mutex
#5 btrfs_update_inode at
ffffffffc0385ba8 <-- this is the same inode that pid 6963 is holding
#6 cow_file_range_inline.constprop.78 at
ffffffffc0386be7
#7 cow_file_range at
ffffffffc03879c1
#8 btrfs_run_delalloc_range at
ffffffffc038894c
#9 writepage_delalloc at
ffffffffc03a3c8f
#10 __extent_writepage at
ffffffffc03a4c01
#11 extent_write_cache_pages at
ffffffffc03a500b
#12 extent_writepages at
ffffffffc03a6de2
#13 do_writepages at
ffffffffa4c277eb
#14 __filemap_fdatawrite_range at
ffffffffa4c1e5bb
#15 btrfs_run_delalloc_work at
ffffffffc0380987 <-- starts running delayed nodes
#16 normal_work_helper at
ffffffffc03b706c
#17 process_one_work at
ffffffffa4aba4e4
#18 worker_thread at
ffffffffa4aba6fd
#19 kthread at
ffffffffa4ac0a3d
#20 ret_from_fork at
ffffffffa54001ff
To fully address those cases the complete fix is to never issue any
flushing while holding the transaction or the delayed node lock. This
patch achieves it by calling qgroup_reserve_meta directly which will
either succeed without flushing or will fail and return -EDQUOT. In the
latter case that return value is going to be propagated to
btrfs_dirty_inode which will fallback to start a new transaction. That's
fine as the majority of time we expect the inode will have
BTRFS_DELAYED_NODE_INODE_DIRTY flag set which will result in directly
copying the in-memory state.
Fixes: c00d6cc1c92a ("btrfs: qgroup: try to flush qgroup space when we get -EDQUOT")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>