which may fail for example (or other errors could happen). When the
failure occurs, btrfs_relocate_block_group() detects the error and frees
rc and doesn't set fs_info->reloc_ctl to NULL. After that, in
btrfs_init_reloc_root(), rc is retrieved from fs_info->reloc_ctl and
then used, which may cause a use-after-free bug.
This possible bug can be triggered by calling btrfs_ioctl_balance()
before calling btrfs_ioctl_defrag().
To fix this possible bug, in prepare_to_relocate(), check if
btrfs_commit_transaction() fails. If the failure occurs,
unset_reloc_control() is called to set fs_info->reloc_ctl to NULL.
The error log in our fault-injection testing is shown as follows:
Josef Bacik [Mon, 25 Jul 2022 17:05:05 +0000 (13:05 -0400)]
btrfs: reset RO counter on block group if we fail to relocate
With the automatic block group reclaim code we will preemptively try to
mark the block group RO before we start the relocation. We do this to
make sure we should actually try to relocate the block group.
However if we hit an error during the actual relocation we won't clean
up our RO counter and the block group will remain RO. This was observed
internally with file systems reporting less space available from df when
we had failed background relocations.
Fix this by doing the dec_ro in the error case.
Fixes: cd361b92e5fd ("btrfs: zoned: automatically reclaim zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: don't call btrfs_page_set_checked in finish_compressed_bio_read
This flag was used to communicate that the low-level compression code
already did verify the checksum to the high-level I/O completion code.
But it has been unused for a long time as the upper btrfs_bio for the
decompressed data had a NULL csum pointer basically since that pointer
existed and the code already checks for that a little later.
Note that this does not affect the other use of the checked flag, which
is only used for the COW fixup worker.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
Currently the checksum of compressed extents is verified based on the
compressed data and the lower btrfs_bio, but the actual repair process
is driven by end_bio_extent_readpage on the upper btrfs_bio for the
decompressed data.
This has a bunch of issues, including not being able to properly
communicate the failed mirror up in case that the I/O submission got
preempted, a general loss of if an error was an I/O error or a checksum
verification failure, but most importantly that this design causes
btrfs_clean_io_failure to eventually write back the uncompressed good
data onto the disk sectors that are supposed to contain compressed data.
Fix this by moving the repair to the lower btrfs_bio. To do so, a fair
amount of code has to be reshuffled:
a) the lower btrfs_bio now needs a valid csum pointer. The easiest way
to achieve that is to pass NULL btrfs_lookup_bio_sums and just use
the btrfs_bio management of csums. For a compressed_bio that is
split into multiple btrfs_bios this means additional memory
allocations, but the code becomes a lot more regular.
b) checksum verification now runs directly on the lower btrfs_bio instead
of the compressed_bio. This actually nicely simplifies the end I/O
processing.
c) btrfs_repair_one_sector can't just look up the logical address for
the file offset any more, as there is no corresponding relative
offsets that apply to the file offset and the logic address for
compressed extents. Instead require that the saved bvec_iter in the
btrfs_bio is filled out for all read bios and use that, which again
removes a fair amount of code.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove the start argument to check_data_csum and export
Derive the value of start from the btrfs_bio now that ->file_offset is
always valid. Also export and rename the function so it's available
outside of inode.c as we'll need that soon.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: pass a btrfs_bio to btrfs_repair_one_sector
Pass the btrfs_bio instead of the plain bio to btrfs_repair_one_sector,
and remove the start and failed_mirror arguments in favor of deriving
them from the btrfs_bio. For this to work ensure that the file_offset
field is also initialized for buffered I/O.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: simplify the pending I/O counting in struct compressed_bio
Instead of counting the sectors just count the bios, with an extra
reference held during submission. This significantly simplifies the
submission side error handling.
This slightly changes completion and error handling of
btrfs_submit_compressed_{read,write} because with the old code the
compressed_bio could have been completed in
submit_compressed_{read,write} only if there was an error during
submission for one of the lower bio, whilst with the new code there is a
chance for this to happen even for successful submission if the all the
lower bios complete before the end of the function is reached.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
When there is more than a single level of redundancy there can also be
multiple bad mirrors, and the current read repair code only repairs the
last bad one.
Restructure btrfs_repair_one_sector so that it records the originally
failed mirror and the number of copies, and then repair all known bad
copies until we reach the originally failed copy in clean_io_failure.
Note that this also means the read repair reads will always start from
the next bad mirror and not mirror 0.
This fixes btrfs/265 in xfstests.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: merge btrfs_dev_stat_print_on_error with its only caller
Fold it into the only caller.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Sun, 17 Jul 2022 21:05:05 +0000 (22:05 +0100)]
btrfs: join running log transaction when logging new name
When logging a new name, in case of a rename, we pin the log before
changing it. We then either delete a directory entry from the log or
insert a key range item to mark the old name for deletion on log replay.
However when doing one of those log changes we may have another task that
started writing out the log (at btrfs_sync_log()) and it started before
we pinned the log root. So we may end up changing a log tree while its
writeback is being started by another task syncing the log. This can lead
to inconsistencies in a log tree and other unexpected results during log
replay, because we can get some committed node pointing to a node/leaf
that ends up not getting written to disk before the next log commit.
The problem, conceptually, started to happen in commit cd26ce2477c9c0
("btrfs: avoid logging all directory changes during renames"), because
there we started to update the log without joining its current transaction
first.
However the problem only became visible with commit 7e3525266208ee
("btrfs: stop doing unnecessary log updates during a rename"), and that is
because we used to pin the log at btrfs_rename() and then before entering
btrfs_log_new_name(), when unlinking the old dentry, we ended up at
btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(). Both
of them join the current log transaction, effectively waiting for any log
transaction writeout (due to acquiring the root's log_mutex). This made it
safe even after leaving the current log transaction, because we remained
with the log pinned when we called btrfs_log_new_name().
Then in commit 7e3525266208ee ("btrfs: stop doing unnecessary log updates
during a rename"), we removed the log pinning from btrfs_rename() and
stopped calling btrfs_del_inode_ref_in_log() and
btrfs_del_dir_entries_in_log() during the rename, and started to do all
the needed work at btrfs_log_new_name(), but without joining the current
log transaction, only pinning the log, which is racy because another task
may have started writeout of the log tree right before we pinned the log.
Both commits landed in kernel 5.18, so it doesn't make any practical
difference which should be blamed, but I'm blaming the second commit only
because with the first one, by chance, the problem did not happen due to
the fact we joined the log transaction after pinning the log and unpinned
it only after calling btrfs_log_new_name().
So make btrfs_log_new_name() join the current log transaction instead of
pinning it, so that we never do log updates if it's writeout is starting.
Fixes: 7e3525266208ee ("btrfs: stop doing unnecessary log updates during a rename") CC: stable@vger.kernel.org # 5.18+ Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Tested-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 14 Jul 2022 10:48:10 +0000 (13:48 +0300)]
btrfs: simplify error handling in btrfs_lookup_dentry
In btrfs_lookup_dentry releasing the reference of the sub_root and the
running orphan cleanup should only happen if the dentry found actually
represents a subvolume. This can only be true in the 'else' branch as
otherwise either fixup_tree_root_location returned an ENOENT error, in
which case sub_root wouldn't have been changed or if we got a different
errno this means btrfs_get_fs_root couldn't have executed successfully
again meaning sub_root will equal to root. So simplify all the branches
by moving the code into the 'else'.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 12 Jul 2022 15:31:22 +0000 (16:31 +0100)]
btrfs: send: always use the rbtree based inode ref management infrastructure
After the patch "btrfs: send: fix sending link commands for existing file
paths", we now have two infrastructures to detect and eliminate duplicated
inode references (due to names that got removed and re-added between the
send and parent snapshots):
1) One that works on a single inode ref/extref item;
2) A new one that works acrosss all ref/extref items for an inode, and
it's also more efficient because even in the single ref/extref item
case, it does not do a linear search for all the names encoded in the
ref/extref item, it uses red black trees to speedup up the search.
There's no good reason to keep both infrastructures, we can use the new
one everywhere, and it's always more efficient.
So remove the old infrastructure and change all sites that are using it
to use the new one.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: send: fix sending link commands for existing file paths
There is a bug sending link commands for existing file paths. When we're
processing an inode, we go over all references. All the new file paths are
added to the "new_refs" list. And all the deleted file paths are added to
the "deleted_refs" list. In the end, when we finish processing the inode,
we iterate over all the items in the "new_refs" list and send link commands
for those file paths. After that, we go over all the items in the
"deleted_refs" list and send unlink commands for them. If there are
duplicated file paths in both lists, we will try to create them before we
remove them. Then the receiver gets an -EEXIST error when trying the link
operations.
Example for having duplicated file paths in both list:
$ btrfs subvolume create vol
# create a file and 2000 hard links to the same inode
$ touch vol/foo
$ for i in {1..2000}; do link vol/foo vol/$i ; done
# take a snapshot for a parent snapshot
$ btrfs subvolume snapshot -r vol snap1
# remove 2000 hard links and re-create the last 1000 links
$ for i in {1..2000}; do rm vol/$i; done;
$ for i in {1001..2000}; do link vol/foo vol/$i; done
# take another one for a send snapshot
$ btrfs subvolume snapshot -r vol snap2
$ mkdir receive_dir
$ btrfs send snap2 -p snap1 | btrfs receive receive_dir/
At subvol snap2
link 1238 -> foo
ERROR: link 1238 -> foo failed: File exists
In this case, we will have the same file paths added to both lists. In the
parent snapshot, reference paths {1..1237} are stored in inode references,
but reference paths {1238..2000} are stored in inode extended references.
In the send snapshot, all reference paths {1001..2000} are stored in inode
references. During the incremental send, we process their inode references
first. In record_changed_ref(), we iterate all its inode references in the
send/parent snapshot. For every inode reference, we also use find_iref() to
check whether the same file path also appears in the parent/send snapshot
or not. Inode references {1238..2000} which appear in the send snapshot but
not in the parent snapshot are added to the "new_refs" list. On the other
hand, Inode references {1..1000} which appear in the parent snapshot but
not in the send snapshot are added to the "deleted_refs" list. Next, when
we process their inode extended references, reference paths {1238..2000}
are added to the "deleted_refs" list because all of them only appear in the
parent snapshot. Now two lists contain items as below:
"new_refs" list: {1238..2000}
"deleted_refs" list: {1..1000}, {1238..2000}
Reference paths {1238..2000} appear in both lists. And as the processing
order mentioned about before, the receiver gets an -EEXIST error when trying
the link operations.
To fix the bug, the idea is to process the "deleted_refs" list before
the "new_refs" list. However, it's not easy to reshuffle the processing
order. For one reason, if we do so, we may unlink all the existing paths
first, there's no valid path anymore for links. And it's inefficient
because we do a bunch of unlinks followed by links for the same paths.
Moreover, it makes less sense to have duplications in both lists. A
reference path cannot not only be regarded as new but also has been seen in
the past, or we won't call it a new path. However, it's also not a good
idea to make find_iref() check a reference against all inode references
and all inode extended references because it may result in large disk
reads.
So we introduce two rbtrees to make the references easier for lookups.
And we also introduce record_new_ref_if_needed() and
record_deleted_ref_if_needed() for changed_ref() to check and remove
duplicated references early.
Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: send: introduce recorded_ref_alloc and recorded_ref_free
Introduce wrappers to allocate and free recorded_ref structures.
Reviewed-by: Robbie Ko <robbieko@synology.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: BingJing Chang <bingjingc@synology.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: wait until zone is finished when allocation didn't progress
When the allocated position doesn't progress, we cannot submit IOs to
finish a block group, but there should be ongoing IOs that will finish a
block group. So, in that case, we wait for a zone to be finished and retry
the allocation after that.
Introduce a new flag BTRFS_FS_NEED_ZONE_FINISH for fs_info->flags to
indicate we need a zone finish to have proceeded. The flag is set when the
allocator detected it cannot activate a new block group. And, it is cleared
once a zone is finished.
CC: stable@vger.kernel.org # 5.16+ Fixes: bdebb195b80b ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: write out partially allocated region
cow_file_range() works in an all-or-nothing way: if it fails to allocate an
extent for a part of the given region, it gives up all the region including
the successfully allocated parts. On cow_file_range(), run_delalloc_zoned()
writes data for the region only when it successfully allocate all the
region.
This all-or-nothing allocation and write-out are problematic when available
space in all the block groups are get tight with the active zone
restriction. btrfs_reserve_extent() try hard to utilize the left space in
the active block groups and gives up finally and fails with
-ENOSPC. However, if we send IOs for the successfully allocated region, we
can finish a zone and can continue on the rest of the allocation on a newly
allocated block group.
This patch implements the partial write-out for run_delalloc_zoned(). With
this patch applied, cow_file_range() returns -EAGAIN to tell the caller to
do something to progress the further allocation, and tells the successfully
allocated region with done_offset. Furthermore, the zoned extent allocator
returns -EAGAIN to tell cow_file_range() going back to the caller side.
Actually, we still need to wait for an IO to complete to continue the
allocation. The next patch implements that part.
CC: stable@vger.kernel.org # 5.16+ Fixes: bdebb195b80b ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
There are two places where allocating a chunk is not enough. These two
places are trying to ensure the space by allocating a chunk. To meet the
condition for active_total_bytes, we also need to activate a block group
there.
CC: stable@vger.kernel.org # 5.16+ Fixes: bdebb195b80b ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: activate metadata block group on flush_space
For metadata space on zoned filesystem, reaching ALLOC_CHUNK{,_FORCE}
means we don't have enough space left in the active_total_bytes. Before
allocating a new chunk, we can try to activate an existing block group
in this case.
Also, allocating a chunk is not enough to grant a ticket for metadata
space on zoned filesystem we need to activate the block group to
increase the active_total_bytes.
btrfs_zoned_activate_one_bg() implements the activation feature. It will
activate a block group by (maybe) finishing a block group. It will give up
activating a block group if it cannot finish any block group.
CC: stable@vger.kernel.org # 5.16+ Fixes: bdebb195b80b ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: disable metadata overcommit for zoned
The metadata overcommit makes the space reservation flexible but it is also
harmful to active zone tracking. Since we cannot finish a block group from
the metadata allocation context, we might not activate a new block group
and might not be able to actually write out the overcommit reservations.
So, disable metadata overcommit for zoned filesystems. We will ensure
the reservations are under active_total_bytes in the following patches.
CC: stable@vger.kernel.org # 5.16+ Fixes: bdebb195b80b ("btrfs: zoned: implement active zone tracking") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
The active_total_bytes, like the total_bytes, accounts for the total bytes
of active block groups in the space_info.
With an introduction of active_total_bytes, we can check if the reserved
bytes can be written to the block groups without activating a new block
group. The check is necessary for metadata allocation on zoned
filesystem. We cannot finish a block group, which may require waiting
for the current transaction, from the metadata allocation context.
Instead, we need to ensure the ongoing allocation (reserved bytes) fits
in active block groups.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zoned: finish least available block group on data bg allocation
When we run out of active zones and no sufficient space is left in any
block groups, we need to finish one block group to make room to activate a
new block group.
However, we cannot do this for metadata block groups because we can cause a
deadlock by waiting for a running transaction commit. So, do that only for
a data block group.
Furthermore, the block group to be finished has two requirements. First,
the block group must not have reserved bytes left. Having reserved bytes
means we have an allocated region but did not yet send bios for it. If that
region is allocated by the thread calling btrfs_zone_finish(), it results
in a deadlock.
Second, the block group to be finished must not be a SYSTEM block
group. Finishing a SYSTEM block group easily breaks further chunk
allocation by nullifying the SYSTEM free space.
In a certain case, we cannot find any zone finish candidate or
btrfs_zone_finish() may fail. In that case, we fall back to split the
allocation bytes and fill the last spaces left in the block groups.
CC: stable@vger.kernel.org # 5.16+ Fixes: bdebb195b80b ("btrfs: zoned: implement active zone tracking") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: use fs_info->max_extent_size in get_extent_max_capacity()
Use fs_info->max_extent_size also in get_extent_max_capacity() for the
completeness. This is only used for defrag and not really necessary to fix
the metadata reservation size. But, it still suppresses unnecessary defrag
operations.
Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: convert count_max_extents() to use fs_info->max_extent_size
If count_max_extents() uses BTRFS_MAX_EXTENT_SIZE to calculate the number
of extents needed, btrfs release the metadata reservation too much on its
way to write out the data.
Now that BTRFS_MAX_EXTENT_SIZE is replaced with fs_info->max_extent_size,
convert count_max_extents() to use it instead, and fix the calculation of
the metadata reservation.
CC: stable@vger.kernel.org # 5.12+ Fixes: de7e061438a3 ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: replace BTRFS_MAX_EXTENT_SIZE with fs_info->max_extent_size
On zoned filesystem, data write out is limited by max_zone_append_size,
and a large ordered extent is split according the size of a bio. OTOH,
the number of extents to be written is calculated using
BTRFS_MAX_EXTENT_SIZE, and that estimated number is used to reserve the
metadata bytes to update and/or create the metadata items.
The metadata reservation is done at e.g, btrfs_buffered_write() and then
released according to the estimation changes. Thus, if the number of extent
increases massively, the reserved metadata can run out.
The increase of the number of extents easily occurs on zoned filesystem
if BTRFS_MAX_EXTENT_SIZE > max_zone_append_size. And, it causes the
following warning on a small RAM environment with disabling metadata
over-commit (in the following patch).
To fix the estimation, we need to introduce fs_info->max_extent_size to
replace BTRFS_MAX_EXTENT_SIZE, which allow setting the different size for
regular vs zoned filesystem.
Set fs_info->max_extent_size to BTRFS_MAX_EXTENT_SIZE by default. On zoned
filesystem, it is set to fs_info->max_zone_append_size.
CC: stable@vger.kernel.org # 5.12+ Fixes: de7e061438a3 ("btrfs: zoned: use ZONE_APPEND write for zoned mode") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
This patch is basically a revert of commit 33944d434fb0 ("btrfs: zoned:
remove max_zone_append_size logic"), but without unnecessary ASSERT and
check. The max_zone_append_size will be used as a hint to estimate the
number of extents to cover delalloc/writeback region in the later commits.
The size of a ZONE APPEND bio is also limited by queue_max_segments(), so
this commit considers it to calculate max_zone_append_size. Technically, a
bio can be larger than queue_max_segments() * PAGE_SIZE if the pages are
contiguous. But, it is safe to consider "queue_max_segments() * PAGE_SIZE"
as an upper limit of an extent size to calculate the number of extents
needed to write data.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 11 Jul 2022 14:22:50 +0000 (15:22 +0100)]
btrfs: add optimized btrfs_ino() version for 64 bits systems
Currently btrfs_ino() tries to use first the objectid of the inode's
location key. This is to avoid truncation of the inode number on 32 bits
platforms because the i_ino field of struct inode has the unsigned long
type, while the objectid is a 64 bits unsigned type (u64) on every system.
This logic was added in commit 4399f0d98ba53a ("Btrfs: Always use 64bit
inode number").
However if we are running on a 64 bits system, we can always directly
return the i_ino value from struct inode, which eliminates the need for
he special if statement that tests for a location key type of
BTRFS_ROOT_ITEM_KEY - in which case i_ino may not have the same value as
the objectid in the inode's location objectid, it may have a value of
BTRFS_EMPTY_SUBVOL_DIR_OBJECTID, for the case of snapshots of trees with
subvolumes/snapshots inside them.
So add a special version for 64 bits system that directly returns i_ino
of struct inode. This eliminates one branch and reduces the overall code
size, since btrfs_ino() is an inline function that is extensively used.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1617487 189240 29032 1835759 1c02ef fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1612028 189180 29032 1830240 1bed60 fs/btrfs/btrfs.ko
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 11 Jul 2022 14:22:49 +0000 (15:22 +0100)]
btrfs: set the objectid of the btree inode's location key
We currently don't use the location key of the btree inode, its content
is set to zeroes, as it's a special inode that is not persisted (it has
no inode item stored in any btree).
At btrfs_ino(), an inline function used extensively in btrfs, we have
this special check if the given inode's location objectid is 0, and if it
is, we return the value stored in the VFS' inode i_ino field instead
(which is BTRFS_BTREE_INODE_OBJECTID for the btree inode).
To reduce the code at btrfs_ino(), we can simply set the objectid of the
btree inode to the value BTRFS_BTREE_INODE_OBJECTID. This eliminates the
need to check for the special case of the objectid being zero, with the
side effect of reducing the overall code size and having less code to
execute, as btrfs_ino() is an inline function.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1617487 189240 29032 1835759 1c02ef fs/btrfs/btrfs.ko
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: replace kmap_atomic() with kmap_local_page()
kmap_atomic() is being deprecated in favor of kmap_local_page() where it
is feasible. With kmap_local_page() mappings are per thread, CPU local,
and not globally visible.
The last use of kmap_atomic is in inode.c where the context is atomic [1]
and can be safely replaced by kmap_local_page.
Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB RAM and booting a
kernel with HIGHMEM64GB enabled.
Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zlib: replace kmap() with kmap_local_page() in zlib_decompress_bio()
The use of kmap() is being deprecated in favor of kmap_local_page(). With
kmap_local_page(), the mapping is per thread, CPU local and not globally
visible.
Therefore, use kmap_local_page() / kunmap_local() in zlib_decompress_bio()
because in this function the mappings are per thread and are not visible
in other contexts.
Tested with xfstests on QEMU + KVM 32-bits VM with 4GB of RAM and
HIGHMEM64G enabled. This patch passes 26/26 tests of group "compress".
Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zlib: replace kmap() with kmap_local_page() in zlib_compress_pages()
The use of kmap() is being deprecated in favor of kmap_local_page(). With
kmap_local_page(), the mapping is per thread, CPU local and not globally
visible.
Therefore, use kmap_local_page() / kunmap_local() in zlib_compress_pages()
because in this function the mappings are per thread and are not visible
in other contexts. Furthermore, drop the mappings of "out_page" which is
allocated within zlib_compress_pages() with alloc_page(GFP_NOFS) and use
page_address().
Tested with xfstests on a QEMU + KVM 32-bits VM with 4GB of RAM booting
a kernel with HIGHMEM64G enabled. This patch passes 26/26 tests of group
"compress".
CC: Qu Wenruo <wqu@suse.com> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: zstd: replace kmap() with kmap_local_page()
The use of kmap() is being deprecated in favor of kmap_local_page(). With
kmap_local_page(), the mapping is per thread, CPU local and not globally
visible.
Therefore, use kmap_local_page() / kunmap_local() in zstd.c because in this
file the mappings are per thread and are not visible in other contexts. In
the meanwhile use plain page_address() on output pages allocated with
the GFP_NOFS flag instead of calling kmap*() on them (since they are
always allocated from ZONE_NORMAL).
Tested with xfstests on QEMU + KVM 32 bits VM with 4GB of RAM, booting a
kernel with HIGHMEM64G enabled.
Suggested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
highmem: Make __kunmap_{local,atomic}() take const void pointer
__kunmap_ {local,atomic}() currently take pointers to void. However, this
is semantically incorrect, since these functions do not change the memory
their arguments point to.
Therefore, make this semantics explicit by modifying the
__kunmap_{local,atomic}() prototypes to take pointers to const void.
As a side effect, compilers may produce more efficient code.
Acked-by: Andrew Morton <akpm@linux-foundation.org> Acked-by: Helge Deller <deller@gmx.de> # parisc Suggested-by: David Sterba <dsterba@suse.cz> Suggested-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Fabio M. De Francesco <fmdefrancesco@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Mon, 4 Jul 2022 11:42:04 +0000 (12:42 +0100)]
btrfs: don't fallback to buffered IO for NOWAIT direct IO writes
Currently, for a direct IO write, if we need to fallback to buffered IO,
either to satisfy the whole write operation or just a part of it, we do
it in the current context even if it's a NOWAIT context. This is not ideal
because we currently don't have support for NOWAIT semantics in the
buffered IO path (we can block for several reasons), so we should instead
return -EAGAIN to the caller, so that it knows it should retry (the whole
operation or what's left of it) in a context where blocking is acceptable.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 23 Jun 2022 15:15:37 +0000 (17:15 +0200)]
btrfs: use enum for btrfs_block_rsv::type
The number of block group reserve types BTRFS_BLOCK_RSV_* is small and
fits to u8 and there's enough left in case we want to add more.
For type safety use the enum but make it 8 bits in the structure to save
space.
The structure size is now 48 on release build, making a slight
improvement in structures where it's embedded, like btrfs_fs_info or
btrfs_inode.
Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: do not return errors from btrfs_submit_dio_bio
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission and the other btrfs bio submission handlers do
and avoids any confusion on who needs to handle errors.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: handle allocation failure in btrfs_wq_submit_bio gracefully
btrfs_wq_submit_bio is used for writeback under memory pressure.
Instead of failing the I/O when we can't allocate the async_submit_bio,
just punt back to the synchronous submission path.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: simplify sync/async submission in btrfs_submit_data_write_bio
btrfs_submit_data_write_bio special cases the reloc root because the
checksums are preloaded, but only does so for the !sync case. The sync
case can't happen for data relocation, but just handling it more generally
significantly simplifies the logic.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: raid56: transfer the bio counter reference to the raid submission helpers
Transfer the bio counter reference acquired by btrfs_submit_bio to
raid56_parity_write and raid56_parity_recovery together with the bio
that the reference was acquired for instead of acquiring another
reference in those helpers and dropping the original one in
btrfs_submit_bio.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: do not return errors from raid56_parity_recover
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Also use the proper bool type for the generic_io argument.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: do not return errors from raid56_parity_write
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches what
the block layer submission does and avoids any confusion on who
needs to handle errors.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
Always consume the bio and call the end_io handler on error instead of
returning an error and letting the caller handle it. This matches
what the block layer submission does and avoids any confusion on who
needs to handle errors.
As this requires touching all the callers, rename the function to
btrfs_submit_bio, which describes the functionality much better.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 17 Jun 2022 10:04:06 +0000 (12:04 +0200)]
btrfs: return proper mapped length for RAID56 profiles in __btrfs_map_block()
For profiles other than RAID56, __btrfs_map_block() returns @map_length
as min(stripe_end, logical + *length), which is also the same result
from btrfs_get_io_geometry().
But for RAID56, __btrfs_map_block() returns @map_length as stripe_len.
This strange behavior is going to hurt incoming bio split at
btrfs_map_bio() time, as we will use @map_length as bio split size.
Fix this behavior by returning @map_length by the same calculation as
for other profiles.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
The raid56 code assumes a fixed stripe length BTRFS_STRIPE_LEN but there
are functions passing it as arguments, this is not necessary. The fixed
value has been used for a long time and though the stripe length should
be configurable by super block member stripesize, this hasn't been
implemented and would require more changes so we don't need to keep this
code around until then.
Partially based on a patch from Qu Wenruo.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Tested-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
[ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 6 Jul 2022 10:14:23 +0000 (11:14 +0100)]
btrfs: remove the inode cache check at btrfs_is_free_space_inode()
The inode cache feature was removed in kernel 5.11, and we no longer have
any code that reads from or writes to inode caches. We may still mount a
filesystem that has inode caches, but they are ignored.
Remove the check for an inode cache from btrfs_is_free_space_inode(),
since we no longer have code to trigger reads from an inode cache or
writes to an inode cache. The check at send.c is still needed, because
in case we find a filesystem with an inode cache, we must ignore it.
Also leave the checks at tree-checker.c, as they are sanity checks.
This eliminates a dead branch and reduces the amount of code since it's
in an inline function.
Before:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1620662 189240 29032 1838934 1c0f56 fs/btrfs/btrfs.ko
After:
$ size fs/btrfs/btrfs.ko
text data bss dec hex filename 1620502 189240 29032 1838774 1c0eb6 fs/btrfs/btrfs.ko
Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Fri, 24 Jun 2022 08:01:23 +0000 (11:01 +0300)]
btrfs: sysfs: remove BIG_METADATA feature files
This flag has been merged in 3.10 and is effectively always-on. Its
status depends on the host page size so there's another way to guarantee
compatibility with old kernels.
Due to a bug introduced in 865daaf884dd ("btrfs: fix upper limit for
max_inline for page size 64K") the flag is not persisted among features
in the superblock so it's not reliable.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Fri, 24 Jun 2022 08:01:22 +0000 (11:01 +0300)]
btrfs: sysfs: remove MIXED_BACKREF feature file
This feature has been the default for about 13 year. At this point it's
safe to consider it an indispensable feature of BTRFS as such there's
no need to advertise it in sysfs. Remove the global sysfs feature file,
the per-filesystem feature file has never been there.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 23 Jun 2022 08:08:58 +0000 (11:08 +0300)]
btrfs: don't print 'has skinny extents' anymore on mount
Skinny extents have been a default mkfs feature since version 3.18 i
(introduced in btrfs-progs commit 6715de04d9a7 ("btrfs-progs: mkfs:
make skinny-metadata default") ). It really doesn't bring any value to
users to simply remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 23 Jun 2022 07:57:52 +0000 (10:57 +0300)]
btrfs: don't print 'flagging with big metadata' anymore on mount
Added in commit 37cc17a2cba2 ("Btrfs: allow metadata blocks larger than
the page size") in 2010 and it's been default for mkfs since 3.12
(2013). The message doesn't really convey any useful information to
users. Remove it.
Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 21 Jun 2022 16:40:48 +0000 (18:40 +0200)]
btrfs: clean up chained assignments
The chained assignments may be convenient to write, but make readability
a bit worse as it's too easy to overlook that there are several values
set on the same line while this is rather an exception. Making it
consistent everywhere avoids surprises.
The pattern where inode times are initialized reuses the first value and
the order is mtime, ctime. In other blocks the assignments are expanded
so the order of variables is similar to the neighboring code.
David Sterba [Thu, 23 Jun 2022 14:57:02 +0000 (16:57 +0200)]
btrfs: merge calculations for simple striped profiles in btrfs_rmap_block
Use the same expression for stripe_nr for RAID0 (map->sub_stripes is 1)
and RAID10 (map->sub_stripes is 2), with equivalent results.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 23 Jun 2022 14:57:00 +0000 (16:57 +0200)]
btrfs: use mask for all RAID1* profiles in btrfs_calc_avail_data_space
There's a sequence of hard coded values for RAID1 profiles that are
already stored in the raid_attr table that should be used instead.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 23 Jun 2022 07:55:47 +0000 (10:55 +0300)]
btrfs: properly flag filesystem with BTRFS_FEATURE_INCOMPAT_BIG_METADATA
Commit 865daaf884dd seemingly inadvertently moved the code responsible
for flagging the filesystem as having BIG_METADATA to a place where
setting the flag was essentially lost. This means that
filesystems created with kernels containing this bug (starting with 5.15)
can potentially be mounted by older (pre-3.4) kernels. In reality
chances for this happening are low because there are other incompat
flags introduced in the mean time. Still the correct behavior is to set
INCOMPAT_BIG_METADATA flag and persist this in the superblock.
Fixes: 865daaf884dd ("btrfs: fix upper limit for max_inline for page size 64K") CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 22 Jun 2022 18:45:18 +0000 (20:45 +0200)]
btrfs: print checksum type and implementation at mount time
Per user request, print the checksum type and implementation at mount
time among the messages. The checksum is user configurable and the
actual crypto implementation is useful to see for performance reasons.
The same information is also available after mount in
/sys/fs/FSID/checksum file.
Example:
[25.323662] BTRFS info (device vdb): using sha256 (sha256-generic) checksum algorithm
Link: https://github.com/kdave/btrfs-progs/issues/483 Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Mon, 13 Jun 2022 22:31:17 +0000 (18:31 -0400)]
btrfs: reset block group chunk force if we have to wait
If you try to force a chunk allocation, but you race with another chunk
allocation, you will end up waiting on the chunk allocation that just
occurred and then allocate another chunk. If you have many threads all
doing this at once you can way over-allocate chunks.
Fix this by resetting force to NO_FORCE, that way if we think we need to
allocate we can, otherwise we don't force another chunk allocation if
one is already happening.
Reviewed-by: Filipe Manana <fdmanana@suse.com> CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Wed, 18 May 2022 16:02:55 +0000 (18:02 +0200)]
btrfs: send: add new command FILEATTR for file attributes
There are file attributes inherited from previous ext2 SETFLAGS/GETFLAGS
and later from XFLAGS interfaces, now commonly found under the
'fileattr' API. This corresponds to the individual inode bits and that's
part of the on-disk format, so this is suitable for the protocol. The
other interfaces contain a lot of cruft or bits that btrfs does not
support yet.
Currently the value is u64 and matches btrfs_inode_item. Not all the
bits can be set by ioctls (like NODATASUM or READONLY), but we can send
them over the protocol and leave it up to the receiving side what and
how to apply.
As some of the flags, eg. IMMUTABLE, can prevent any further changes,
the receiving side needs to understand that and apply the changes in the
right order, or possibly with some intermediate steps. This should be
easier, future proof and simpler on the protocol layer than implementing
in kernel.
Reviewed-by: Omar Sandoval <osandov@fb.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Tue, 17 May 2022 14:50:30 +0000 (16:50 +0200)]
btrfs: send: add OTIME as utimes attribute for proto 2+ by default
When send v1 was introduced the otime (inode creation time) was not
available, however the attribute in btrfs send protocol exists. Though
it would be possible to add it for v1 too as the attribute would be
ignored by v1 receive, let's not change the layout of v1 and only add
that to v2+. The otime cannot be changed and is only informative.
Naohiro Aota [Tue, 21 Jun 2022 06:41:02 +0000 (15:41 +0900)]
btrfs: replace unnecessary goto with direct return at cow_file_range()
The 'goto out' in cow_file_range() in the exit block are not necessary
and jump back. Replace them with return, while still keeping 'goto out'
in the main code.
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ keep goto in the main code, update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Tue, 21 Jun 2022 06:41:01 +0000 (15:41 +0900)]
btrfs: fix error handling of fallback uncompress write
When cow_file_range() fails in the middle of the allocation loop, it
unlocks the pages but leaves the ordered extents intact. Thus, we need
to call btrfs_cleanup_ordered_extents() to finish the created ordered
extents.
Also, we need to call end_extent_writepage() if locked_page is available
because btrfs_cleanup_ordered_extents() never processes the region on
the locked_page.
Furthermore, we need to set the mapping as error if locked_page is
unavailable before unlocking the pages, so that the errno is properly
propagated to the user space.
CC: stable@vger.kernel.org # 5.18+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Naohiro Aota [Tue, 21 Jun 2022 06:41:00 +0000 (15:41 +0900)]
btrfs: extend btrfs_cleanup_ordered_extents for NULL locked_page
btrfs_cleanup_ordered_extents() assumes locked_page to be non-NULL, so it
is not usable for submit_uncompressed_range() which can have NULL
locked_page.
Add support supports locked_page == NULL case. Also, it rewrites
redundant "page_offset(locked_page)".
Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
While we debug the issue, we found running fstests generic/551 on 5GB
non-zoned null_blk device in the emulated zoned mode also had a
similar hung issue.
Also, we can reproduce the same symptom with an error injected
cow_file_range() setup.
The hang occurs when cow_file_range() fails in the middle of
allocation. cow_file_range() called from do_allocation_zoned() can
split the give region ([start, end]) for allocation depending on
current block group usages. When btrfs can allocate bytes for one part
of the split regions but fails for the other region (e.g. because of
-ENOSPC), we return the error leaving the pages in the succeeded regions
locked. Technically, this occurs only when @unlock == 0. Otherwise, we
unlock the pages in an allocated region after creating an ordered
extent.
Considering the callers of cow_file_range(unlock=0) won't write out
the pages, we can unlock the pages on error exit from
cow_file_range(). So, we can ensure all the pages except @locked_page
are unlocked on error case.
In summary, cow_file_range now behaves like this:
- page_started == 1 (return value)
- All the pages are unlocked. IO is started.
- unlock == 1
- All the pages except @locked_page are unlocked in any case
- unlock == 0
- On success, all the pages are locked for writing out them
- On failure, all the pages except @locked_page are unlocked
Fixes: c67fd4a99ef6 ("btrfs: zoned: introduce dedicated data write path for zoned filesystems") CC: stable@vger.kernel.org # 5.12+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
The values are in one file so reading them at a single time will give a
more consistent view. The stats are internally tracked in nanoseconds so
the cumulative values should not suffer from rounding errors.
Writing 0 to the file 'commit_stats' will reset max_commit_ms.
Initial values are set at first mount of the filesystem.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com>
[ update changelog ] Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Track several stats about transaction commit, to be later exported via
sysfs:
- number of commits so far
- duration of the last commit in ns
- maximum commit duration seen so far in ns
- total duration for all commits so far in ns
The update of the commit stats occurs after the commit thread has gone
through all the logic that checks if there is another thread committing
at the same time. This means that we only account for actual commit work
in the commit stats we report and not the time the thread spends waiting
until it is ready to do the commit work.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Ioannis Angelakopoulos <iangelak@fb.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove extent writepage address space operation
Same as in commit 4a09a6b37802 ("xfs: drop ->writepage completely"): we
can remove the callback as it's only used in one place - single page
writeback from memory reclaim and is not called for cgroup writeback at
all.
We only allow such writeback from kswapd, not from direct memory
reclaim, and so it is rarely used. When it comes from kswapd, it is
effectively random dirty page shoot-down, which is horrible for IO
patterns. We can rely on background writeback to clean all dirty pages
in an efficient way and not let it be interrupted by kswapd.
Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 2 Jun 2022 13:40:46 +0000 (15:40 +0200)]
btrfs: send: remove old TODO regarding ERESTARTSYS
The whole send operation is restartable and handling properly a buffer
write may not be easy. We can't know what caused that and if a short
delay and retry will fix it or how many retries should be performed in
case it's a temporary condition.
The error value is returned to the ioctl caller so in case it's
transient problem, the user would be notified about the reason. Remove
the TODO note as there's no plan to handle ERESTARTSYS.
btrfs: increase direct io read size limit to 256 sectors
Btrfs currently limits direct I/O reads to a single sector, which goes
back to commit 553f271606fc ("Btrfs: don't allocate a separate csums
array for direct reads") from Josef. That commit changes the direct I/O
code to ".. use the private part of the io_tree for our csums.", but ten
years later that isn't how checksums for direct reads work, instead they
use a csums allocation on a per-btrfs_dio_private basis (which have their
own performance problem for small I/O, but that will be addressed later).
There is no fundamental limit in btrfs itself to limit the I/O size
except for the size of the checksum array that scales linearly with
the number of sectors in an I/O. Pick a somewhat arbitrary limit of
256 limits, which matches what the buffered reads typically see as
the upper limit as the limit for direct I/O as well.
This significantly improves direct read performance. For example a fio
run doing 1 MiB aio reads with a queue depth of 1 roughly triples the
throughput:
Before we enter raid56_parity_recover(), we have triggered some metadata
write for the full stripe 38928384, this leads to us to read all the
sectors from disk.
Furthermore, btrfs raid56 write will cache its calculated P/Q sectors to
avoid unnecessary read.
This means, for that full stripe, after any partial write, we will have
stale data, along with P/Q calculated using that stale data.
Thankfully due to patch "btrfs: only write the sectors in the vertical stripe
which has data stripes" we haven't submitted all the corrupted P/Q to disk.
When we really need to recover certain range, aka in
raid56_parity_recover(), we will use the cached rbio, along with its
cached sectors (the full stripe is all cached).
This explains why we have no event raid56_scrub_read_recover()
triggered.
Since we have the cached P/Q which is calculated using the stale data,
the recovered one will just be stale.
In our particular test case, it will always return the same incorrect
metadata, thus causing the same error message "parent transid verify
failed on 39010304 wanted 9 found 7" again and again.
[BTRFS DESTRUCTIVE RMW PROBLEM]
Test case btrfs/125 (and above workload) always has its trouble with
the destructive read-modify-write (RMW) cycle:
0 32K 64K
Data1: | Good | Good |
Data2: | Bad | Bad |
Parity: | Good | Good |
In above case, if we trigger any write into Data1, we will use the bad
data in Data2 to re-generate parity, killing the only chance to recovery
Data2, thus Data2 is lost forever.
This destructive RMW cycle is not specific to btrfs RAID56, but there
are some btrfs specific behaviors making the case even worse:
- Btrfs will cache sectors for unrelated vertical stripes.
In above example, if we're only writing into 0~32K range, btrfs will
still read data range (32K ~ 64K) of Data1, and (64K~128K) of Data2.
This behavior is to cache sectors for later update.
Incidentally commit ba855191fe99 ("btrfs: raid56: make steal_rbio()
subpage compatible") has a bug which makes RAID56 to never trust the
cached sectors, thus slightly improve the situation for recovery.
Unfortunately, follow up fix "btrfs: update stripe_sectors::uptodate in
steal_rbio" will revert the behavior back to the old one.
- Btrfs raid56 partial write will update all P/Q sectors and cache them
This means, even if data at (64K ~ 96K) of Data2 is free space, and
only (96K ~ 128K) of Data2 is really stale data.
And we write into that (96K ~ 128K), we will update all the parity
sectors for the full stripe.
This unnecessary behavior will completely kill the chance of recovery.
Thankfully, an unrelated optimization "btrfs: only write the sectors
in the vertical stripe which has data stripes" will prevent
submitting the write bio for untouched vertical sectors.
That optimization will keep the on-disk P/Q untouched for a chance for
later recovery.
[FIX]
Although we have no good way to completely fix the destructive RMW
(unless we go full scrub for each partial write), we can still limit the
damage.
With patch "btrfs: only write the sectors in the vertical stripe which
has data stripes" now we won't really submit the P/Q of unrelated
vertical stripes, so the on-disk P/Q should still be fine.
Now we really need to do is just drop all the cached sectors when doing
recovery.
By this, we have a chance to read the original P/Q from disk, and have a
chance to recover the stale data, while still keep the cache to speed up
regular write path.
In fact, just dropping all the cache for recovery path is good enough to
allow the test case btrfs/125 along with the small script to pass
reliably.
The lack of metadata write after the degraded mount, and forced metadata
COW is saving us this time.
So this patch will fix the behavior by not trust any cache in
__raid56_parity_recover(), to solve the problem while still keep the
cache useful.
But please note that this test pass DOES NOT mean we have solved the
destructive RMW problem, we just do better damage control a little
better.
Related patches:
- btrfs: only write the sectors in the vertical stripe
- ba855191fe99 ("btrfs: raid56: make steal_rbio() subpage compatible")
- btrfs: update stripe_sectors::uptodate in steal_rbio
Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
btrfs: remove the finish_func argument to btrfs_mark_ordered_io_finished
finish_func is always set to finish_ordered_fn, so remove it and also
the now pointless and somewhat confusingly named
__endio_write_update_ordered wrapper.
Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Fri, 17 Jun 2022 12:53:34 +0000 (15:53 +0300)]
btrfs: batch up release of reserved metadata for delayed items used for deletion
With Filipe's recent rework of the delayed inode code one aspect which
isn't batched is the release of the reserved metadata of delayed inode's
delete items. With this patch on top of Filipe's rework and running the
same test as provided in the description of a patch titled
"btrfs: improve batch deletion of delayed dir index items" I observe
the following change of the number of calls to btrfs_block_rsv_release:
Before this change:
- block_rsv_release: 1004
- btrfs_delete_delayed_items_total_time: 14602
- delete_batches: 505
Qu Wenruo [Mon, 13 Jun 2022 07:06:35 +0000 (15:06 +0800)]
btrfs: warn about dev extents that are inside the reserved range
Btrfs on-disk format has reserved the first 1MiB for the primary super
block (at 64KiB offset) and bootloaders may also use this space.
This behavior is only introduced since v4.1 btrfs-progs release,
although kernel can ensure we never touch the reserved range of super
blocks, it's better to inform the end users, and a balance will resolve
the problem.
Signed-off-by: Qu Wenruo <wqu@suse.com>
[ update changelog and message ] Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Mon, 13 Jun 2022 07:06:34 +0000 (15:06 +0800)]
btrfs: use named constant for reserved device space
There's a reserved space on each device of size 1MiB that can be used by
bootloaders or to avoid accidental overwrite. Use a symbolic constant
with the explaining comment instead of hard coding the value and
multiple comments.
Note: since btrfs-progs v4.1, mkfs.btrfs will reserve the first 1MiB for
the primary super block (at offset 64KiB), until then the range could
have been used by mistake. Kernel has been always respecting the 1MiB
range for writes.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com>
[ update changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Mon, 6 Jun 2022 17:32:59 +0000 (19:32 +0200)]
btrfs: sink iterator parameter to btrfs_ioctl_logical_to_ino
There's only one function we pass to iterate_inodes_from_logical as
iterator, so we can drop the indirection and call it directly, after
moving the function to backref.c
David Sterba [Mon, 6 Jun 2022 17:06:17 +0000 (19:06 +0200)]
btrfs: simplify parameters of backref iterators
The inode reference iterator interface takes parameters that are derived
from the context parameter, but as it's a void* type the values are
passed individually.
Change the ctx type to inode_fs_path as it's the only thing we pass and
drop any parameters that are derived from that.
David Sterba [Mon, 6 Jun 2022 16:52:24 +0000 (18:52 +0200)]
btrfs: call inode_to_path directly and drop indirection
The functions for iterating inode reference take a function parameter
but there's only one value, inode_to_path(). Remove the indirection and
call the function. As paths_from_inode would become just an alias for
iterate_irefs(), merge the two into one function.
Qu Wenruo [Fri, 13 May 2022 08:34:31 +0000 (16:34 +0800)]
btrfs: use ncopies from btrfs_raid_array in btrfs_num_copies()
For all non-RAID56 profiles, we can use btrfs_raid_array[].ncopies
directly, only for RAID5 and RAID6 we need some extra handling as
there's no table value for that.
For RAID10 there's a change from sub_stripes to ncopies. The values are
the same but semantically we want to use number of copies, as this is
what btrfs_num_copies does.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 13 May 2022 08:34:30 +0000 (16:34 +0800)]
btrfs: use btrfs_raid_array to calculate number of parity stripes
Use the raid table instead of hard coded values and rename the helper as
it is exported. This could make later extension on RAID56 based
profiles easier.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Fri, 13 May 2022 08:34:28 +0000 (16:34 +0800)]
btrfs: remove parameter dev_extent_len from scrub_stripe()
For scrub_stripe() we can easily calculate the dev extent length as we
have the full info of the chunk.
Thus there is no need to pass @dev_extent_len from the caller, and we
introduce a helper, btrfs_calc_stripe_length(), to do the calculation
from extent_map structure.
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
David Sterba [Thu, 25 Jun 2020 17:03:41 +0000 (19:03 +0200)]
btrfs: unify tree search helper returning prev and next nodes
Simplify helper to return only next and prev pointers, we don't need all
the node/parent/prev/next pointers of __etree_search as there are now
other specialized helpers. Rename parameters so they follow the naming.
David Sterba [Thu, 25 Jun 2020 16:49:39 +0000 (18:49 +0200)]
btrfs: make tree search for insert more generic and use it for tree_search
With a slight extension of tree_search_for_insert (fill the return node
and parent return parameters) we can avoid calling __etree_search from
tree_search, that could be removed eventually in followup patches.
David Sterba [Thu, 25 Jun 2020 16:35:24 +0000 (18:35 +0200)]
btrfs: open code inexact rbtree search in tree_search
The call chain from
tree_search
tree_search_for_insert
__etree_search
can be open coded and allow further simplifications, here we need a tree
search with fallback to the next node in case it's not found. This is
represented as __etree_search parameters next_ret=valid, prev_ret=NULL.
David Sterba [Thu, 25 Jun 2020 16:11:31 +0000 (18:11 +0200)]
btrfs: add fast path for extent_state insertion
In two cases the exact location where to insert the extent state is
known at the call time so we don't need to pass it to insert_state that
takes the fast path.
David Sterba [Thu, 25 Jun 2020 15:54:54 +0000 (17:54 +0200)]
btrfs: pass bits by value not by pointer for extent_state helpers
The bits are passed to all extent state helpers for no apparent reason,
the value only read and never updated so remove the indirection and pass
it directly. Also unify the type to u32 where needed.
Qu Wenruo [Thu, 2 Jun 2022 07:51:19 +0000 (15:51 +0800)]
btrfs: raid56: avoid double for loop inside __raid56_parity_recover()
The double for loop can be easily converted to single for loop as we're
really iterating the sectors in their bytenr order.
The only exception is the full stripe skip, however that can also easily
be done inside the loop. Add an ASSERT() along with a comment for that
specific case.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Qu Wenruo [Thu, 2 Jun 2022 07:51:18 +0000 (15:51 +0800)]
btrfs: raid56: avoid double for loop inside finish_rmw()
We can easily calculate the stripe number and sector number inside the
stripe. Thus there is not much need for a double for loop.
For the only case we want to skip the whole stripe, we can manually
increase @total_sector_nr.
This is not a recommended behavior, thus every time the iterator gets
modified there will be a comment along with an ASSERT() for it.
Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>