David Sterba [Tue, 1 Feb 2022 14:42:07 +0000 (15:42 +0100)]
btrfs: replace BUILD_BUG_ON by static_assert
The static_assert introduced in 6bab69c65013 ("build_bug.h: add wrapper
for _Static_assert") has been supported by compilers for a long time
(gcc 4.6, clang 3.0) and can be used in header files. We don't need to
put BUILD_BUG_ON to random functions but rather keep it next to the
definition.
The exception here is the UAPI header btrfs_tree.h that could be
potentially included by userspace code and the static assert is not
defined (nor used in any other header).
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Allow creating or reading block-groups on a zoned device with DUP as a
meta-data profile.
This works because we're using the zoned_meta_io_lock and REQ_OP_WRITE
operations for meta-data on zoned btrfs, so all writes to meta-data zones
are aligned to the zone's write-pointer.
Upon loading of the block-group, it is ensured both zones do have the same
zone capacity and write-pointer offsets, so no extra machinery is needed
to keep the write-pointers in sync for the meta-data zones. If this
prerequisite is not met, loading of the block-group is refused.
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:08 +0000 (15:40 -0500)]
btrfs: add support for multiple global roots
With extent tree v2 you will be able to create multiple csum, extent,
and free space trees. They will be used based on the block group, which
will now use the block_group_item->chunk_objectid to point to the set of
global roots that it will use. When allocating new block groups we'll
simply mod the gigabyte offset of the block group against the number of
global roots we have and that will be the block groups global id.
>From there we can take the bytenr that we're modifying in the respective
tree, look up the block group and get that block groups corresponding
global root id. From there we can get to the appropriate global root
for that bytenr.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:06 +0000 (15:40 -0500)]
btrfs: abstract out loading the tree root
We're going to be adding more roots that need to be loaded from the
super block, so abstract out the code to read the tree_root from the
super block, and use this helper for the chunk root as well. This will
make it simpler to load the new trees in the future.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:04 +0000 (15:40 -0500)]
btrfs: disable space cache related mount options for extent tree v2
We cannot fall back on the slow caching for extent tree v2, which means
we can't just arbitrarily clear the free space trees at mount time.
Furthermore we can't do v1 space cache with extent tree v2. Simply
ignore these mount options for extent tree v2 as they aren't relevant.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:02 +0000 (15:40 -0500)]
btrfs: disable scrub for extent-tree-v2
Scrub depends on extent references for every block, and with extent tree
v2 we won't have that, so disable scrub until we can add back the proper
code to handle extent-tree-v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:40:01 +0000 (15:40 -0500)]
btrfs: disable qgroups in extent tree v2
Backref lookups are going to be drastically different with extent tree
v2, disable qgroups until we do the work to add this support for extent
tree v2.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Josef Bacik [Wed, 15 Dec 2021 20:39:59 +0000 (15:39 -0500)]
btrfs: disable balance for extent tree v2 for now
With global root id's it makes it problematic to do backref lookups for
balance. This isn't hard to deal with, but future changes are going to
make it impossible to lookup backrefs on any COWonly roots, so go ahead
and disable balance for now on extent tree v2 until we can add balance
support back in future patches.
Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:11 +0000 (11:00 +0000)]
btrfs: use single variable to track return value at btrfs_log_inode()
At btrfs_log_inode(), we have two variables to track errors and the
return value of the function, named 'ret' and 'err'. In some places we
use 'ret' and if gets a non-zero value we assign its value to 'err'
and then jump to the 'out' label, while in other places we use 'err'
directly without 'ret' as an intermediary. This is inconsistent, error
prone and not necessary. So change that to use only the 'ret' variable,
making this consistent with most functions in btrfs.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:10 +0000 (11:00 +0000)]
btrfs: avoid inode logging during rename and link when possible
During a rename or link operation, we need to determine if an inode was
previously logged or not, and if it was, do some update to the logged
inode. We used to rely exclusively on the logged_trans field of struct
btrfs_inode to determine that, but that was not reliable because the
value of that field is not persisted in the inode item, so it's lost
when an inode is evicted and loaded back again. That led to several
issues in the past, such as not persisting deletions (such as the case
fixed by commit 803f0f64d17769 ("Btrfs: fix fsync not persisting dentry
deletions due to inode evictions")), or resulting in losing a file
after an inode eviction followed by a rename (commit ecc64fab7d49c6
("btrfs: fix lost inode on log replay after mix of fsync, rename and
inode eviction")), besides other issues.
So the inode_logged() helper was introduced and used to determine if an
inode was possibly logged before in the current transaction, with the
caveat that it could return false positives, in the sense that even if an
inode was not logged before in the current transaction, it could still
return true, but never to return false in case the inode was logged.
>From a functional point of view that is fine, but from a performance
perspective it can introduce significant latencies to rename and link
operations, as they will end up doing inode logging even when it is not
necessary.
Recently on a 5.15 kernel, an openSUSE Tumbleweed user reported package
installations and upgrades, with the zypper tool, were often taking a
long time to complete. With strace it could be observed that zypper was
spending about 99% of its time on rename operations, and then with
further analysis we checked that directory logging was happening too
frequently. Taking into account that installation/upgrade of some of the
packages needed a few thousand file renames, the slowdown was very
noticeable for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14 or
a 5.16-rc8 kernel. While triggering the inode evictions if something
outside btrfs' control, btrfs could still behave better by eliminating
the false positives from the inode_logged() helper.
So change inode_logged() to actually eliminate such false positives caused
by inode eviction and when an inode was never logged since the filesystem
was mounted, as both cases relate to when the logged_trans field of struct
btrfs_inode has a value of zero. When it can not determine if the inode
was logged based only on the logged_trans value, lookup for the existence
of the inode item in the log tree - if it's there then we known the inode
was logged, if it's not there then it can not have been logged in the
current transaction. Once we determine if the inode was logged, update
the logged_trans value to avoid future calls to have to search in the log
tree again.
Alternatively, we could start storing logged_trans in the on disk inode
item structure (struct btrfs_inode_item) in the unused space it still has,
but that would be a bit odd because:
1) We only care about logged_trans since the filesystem was mounted, we
don't care about its value from a previous mount. Having it persisted
in the inode item structure would not make the best use of the precious
unused space;
2) In order to get logged_trans persisted before inode eviction, we would
have to update the delayed inode when we finish logging the inode and
update its logged_trans in struct btrfs_inode, which makes it a bit
cumbersome since we need to check if the delayed inode exists, if not
create it and populate it and deal with any errors (-ENOMEM mostly).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
The following test script mimics part of what the zypper tool does during
package installations/upgrades. It does not triggers inode evictions, but
it's similar because it triggers false positives from the inode_logged()
helper, because the inodes have a logged_trans of 0, there's a log tree
due to a fsync of an unrelated file and the directory inode has its
last_trans field set to the current transaction:
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Now do some change to an unrelated file and fsync it.
# This is just to create a log tree to make sure that inode_logged()
# does not return false when called against "testdir".
xfs_io -f -c "pwrite 0 4K" -c "fsync" $MNT/foo
# Do some change to testdir. This is to make sure inode_logged()
# will return true when called against "testdir", because its
# logged_trans is 0, it was changed in the current transaction
# and there's a log tree.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
Testing this change on a box using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27837 ms
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9236 ms (-66.8%)
NUM_FILES=10000, after whole patchset applied: 8902 ms (-68.0%)
NUM_FILES=5000, before patchset: 9127 ms
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4640 ms (-49.2%)
NUM_FILES=5000, after whole patchset applied: 4441 ms (-51.3%)
NUM_FILES=2000, before patchset: 2528 ms
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1983 ms (-21.6%)
NUM_FILES=2000, after whole patchset applied: 1747 ms (-30.9%)
NUM_FILES=1000, before patchset: 1085 ms
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 893 ms (-17.7%)
NUM_FILES=1000, after whole patchset applied: 867 ms (-20.1%)
Running dbench on the same physical machine with the following script:
Filipe Manana [Thu, 20 Jan 2022 11:00:09 +0000 (11:00 +0000)]
btrfs: stop doing unnecessary log updates during a rename
During a rename, we call __btrfs_unlink_inode(), which will call
btrfs_del_inode_ref_in_log() and btrfs_del_dir_entries_in_log(), in order
to remove an inode reference and a directory entry from the log. These
are necessary when __btrfs_unlink_inode() is called from the unlink path,
but not necessary when it's called from a rename context, because:
1) For the btrfs_del_inode_ref_in_log() call, it's pointless to delete the
inode reference related to the old name, because later in the rename
path we call btrfs_log_new_name(), which will drop all inode references
from the log and copy all inode references from the subvolume tree to
the log tree. So we are doing one unnecessary btree operation which
adds additional latency and lock contention in case there are other
tasks accessing the log tree;
2) For the btrfs_del_dir_entries_in_log() call, we are now doing the
equivalent at btrfs_log_new_name() since the previous patch in the
series, that has the subject "btrfs: avoid logging all directory
changes during renames". In fact, having __btrfs_unlink_inode() call
this function not only adds additional latency and lock contention due
to the extra btree operation, but also can make btrfs_log_new_name()
unnecessarily log a range item to track the deletion of the old name,
since it has no way to known that the directory entry related to the
old name was previously logged and already deleted by
__btrfs_unlink_inode() through its call to
btrfs_del_dir_entries_in_log().
So skip those calls at __btrfs_unlink_inode() when we are doing a rename.
Skipping them also allows us now to reduce the duration of time we are
pinning a log transaction during renames, which is always beneficial as
it's not delaying so much other tasks trying to sync the log tree, in
particular we end up not holding the log transaction pinned while adding
the new name (adding inode ref, directory entry, etc).
This change is part of a patchset comprised of the following patches:
1/5 btrfs: add helper to delete a dir entry from a log tree
2/5 btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
3/5 btrfs: avoid logging all directory changes during renames
4/5 btrfs: stop doing unnecessary log updates during a rename
5/5 btrfs: avoid inode logging during rename and link when possible
Just like the previous patch in the series, "btrfs: avoid logging all
directory changes during renames", the following script mimics part of
what a package installation/upgrade with zypper does, which is basically
renaming a lot of files, in some directory under /usr, to a name with a
suffix of "-RPMDELETE":
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Do some change to testdir and fsync it.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
xfs_io -c "fsync" $MNT/testdir
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
Testing this change on box a using a non-debug kernel (Debian's default
kernel config) gave the following results:
NUM_FILES=10000, before patchset: 27399 ms
NUM_FILES=10000, after patches 1/5 to 3/5 applied: 9093 ms (-66.8%)
NUM_FILES=10000, after patches 1/5 to 4/5 applied: 9016 ms (-67.1%)
NUM_FILES=5000, before patchset: 9241 ms
NUM_FILES=5000, after patches 1/5 to 3/5 applied: 4642 ms (-49.8%)
NUM_FILES=5000, after patches 1/5 to 4/5 applied: 4553 ms (-50.7%)
NUM_FILES=2000, before patchset: 2550 ms
NUM_FILES=2000, after patches 1/5 to 3/5 applied: 1788 ms (-29.9%)
NUM_FILES=2000, after patches 1/5 to 4/5 applied: 1767 ms (-30.7%)
NUM_FILES=1000, before patchset: 1088 ms
NUM_FILES=1000, after patches 1/5 to 3/5 applied: 905 ms (-16.9%)
NUM_FILES=1000, after patches 1/5 to 4/5 applied: 883 ms (-18.8%)
The next patch in the series (5/5), also contains dbench results after
applying to whole patchset.
Filipe Manana [Thu, 20 Jan 2022 11:00:08 +0000 (11:00 +0000)]
btrfs: avoid logging all directory changes during renames
When doing a rename of a file, if the file or its old parent directory
were logged before, we log the new name of the file and then make sure
we log the old parent directory, to ensure that after a log replay the
old name of the file is deleted and the new name added.
The logging of the old parent directory can take some time, because it
will scan all leaves modified in the current transaction, check which
directory entries were already logged, copy the ones that were not
logged before, etc. In this rename context all we need to do is make
sure that the old name of the file is deleted on log replay, so instead
of triggering a directory log operation, we can just delete the old
directory entry from the log if it's there, or in case it isn't there,
just log a range item to signal log replay that the old name must be
deleted. So change btrfs_log_new_name() to do that.
This scenario is actually not uncommon to trigger, and recently on a
5.15 kernel, an openSUSE Tumbleweed user reported package installations
and upgrades, with the zypper tool, were often taking a long time to
complete, much more than usual. With strace it could be observed that
zypper was spending over 99% of its time on rename operations, and then
with further analysis we checked that directory logging was happening
too frequently and causing high latencies for the rename operations.
Taking into account that installation/upgrade of some of these packages
needed about a few thousand file renames, the slowdown was very noticeable
for the user.
The issue was caused indirectly due to an excessive number of inode
evictions on a 5.15 kernel, about 100x more compared to a 5.13, 5.14
or a 5.16-rc8 kernel. After an inode eviction we can't tell for sure,
in an efficient way, if an inode was previously logged in the current
transaction, so we are pessimistic and assume it was, because in case
it was we need to update the logged inode. More details on that in one
of the patches in the same series (subject "btrfs: avoid inode logging
during rename and link when possible"). Either way, in case the parent
directory was logged before, we currently do more work then necessary
during a rename, and this change minimizes that amount of work.
The following script mimics part of what a package installation/upgrade
with zypper does, which is basically renaming a lot of files, in some
directory under /usr, to a name with a suffix of "-RPMDELETE":
$ cat test.sh
#!/bin/bash
DEV=/dev/nvme0n1
MNT=/mnt/nvme0n1
NUM_FILES=10000
mkfs.btrfs -f $DEV
mount $DEV $MNT
mkdir $MNT/testdir
for ((i = 1; i <= $NUM_FILES; i++)); do
echo -n > $MNT/testdir/file_$i
done
sync
# Do some change to testdir and fsync it.
echo -n > $MNT/testdir/file_$((NUM_FILES + 1))
xfs_io -c "fsync" $MNT/testdir
echo "Renaming $NUM_FILES files..."
start=$(date +%s%N)
for ((i = 1; i <= $NUM_FILES; i++)); do
mv $MNT/testdir/file_$i $MNT/testdir/file_$i-RPMDELETE
done
end=$(date +%s%N)
Filipe Manana [Thu, 20 Jan 2022 11:00:07 +0000 (11:00 +0000)]
btrfs: pass the dentry to btrfs_log_new_name() instead of the inode
In the next patch in the series, there will be the need to access the old
name, and its length, of an inode when logging the inode during a rename.
So instead of passing the inode to btrfs_log_new_name() pass the dentry,
because from the dentry we can get the inode, the name and its length.
This will avoid passing 3 new parameters to btrfs_log_new_name() in the
next patch - the name, its length and an index number. This way we end
up passing only 1 new parameter, the index number.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Thu, 20 Jan 2022 11:00:06 +0000 (11:00 +0000)]
btrfs: add helper to delete a dir entry from a log tree
Move the code that finds and deletes a logged dir entry out of
btrfs_del_dir_entries_in_log() into a helper function. This new helper
function will be used by another patch in the same series, and serves
to avoid having duplicated logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Thu, 13 Jan 2022 15:16:18 +0000 (17:16 +0200)]
btrfs: move QUOTA_ENABLED check to rescan_should_stop from btrfs_qgroup_rescan_worker
Instead of having 2 places that short circuit the qgroup leaf scan have
everything in the qgroup_rescan_leaf function. In addition to that, also
ensure that the inconsistent qgroup flag is set when rescan_should_stop
returns true. This both retains the old behavior when -EINTR was set in
the body of the loop and at the same time also extends this behavior
when scanning is interrupted due to remount or unmount operations.
Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:02 +0000 (13:06 +0800)]
btrfs: use dev_t to match device in device_matched
Commit "btrfs: add device major-minor info in the struct btrfs_device"
saved the device major-minor number in the struct btrfs_device upon
discovering it.
So no need to lookup_bdev() again just match, which means
device_matched() can go away.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:01 +0000 (13:06 +0800)]
btrfs: add device major-minor info in the struct btrfs_device
Internally it is common to use the major-minor number to identify a
device and, at a few locations in btrfs, we use the major-minor number
to match the device.
So when we identify a new btrfs device through device add or device
replace or device-scan/ready save the device's major-minor (dev_t) in the
struct btrfs_device so that we don't have to call lookup_bdev() again.
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:06:00 +0000 (13:06 +0800)]
btrfs: match stale devices by dev_t
After the commit "btrfs: harden identification of the stale device", we
don't have to match the device path anymore. Instead, we match the dev_t.
So pass in the dev_t instead of the device path, in the call chain
btrfs_forget_devices()->btrfs_free_stale_devices().
Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Anand Jain [Wed, 12 Jan 2022 05:05:59 +0000 (13:05 +0800)]
btrfs: harden identification of a stale device
Identifying and removing the stale device from the fs_uuids list is done
by btrfs_free_stale_devices(). btrfs_free_stale_devices() in turn
depends on device_path_matched() to check if the device appears in more
than one btrfs_device structure.
The matching of the device happens by its path, the device path. However,
when device mapper is in use, the dm device paths are nothing but a link
to the actual block device, which leads to the device_path_matched()
failing to match.
Fix this by matching the dev_t as provided by lookup_bdev() instead of
plain string compare of the device paths.
Reported-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
Nikolay Borisov [Tue, 11 Jan 2022 16:00:26 +0000 (18:00 +0200)]
btrfs: move missing device handling in a dedicate function
This simplifies the code flow in read_one_chunk and makes error handling
when handling missing devices a bit simpler by reducing it to a single
check if something went wrong. No functional changes.
Reviewed-by: Su Yue <l@damenly.su> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:20:01 +0000 (12:20 +0000)]
btrfs: stop trying to log subdirectories created in past transactions
When logging a directory we are trying to log subdirectories that were
changed in the current transaction and created in a past transaction.
This type of behaviour was introduced by commit 2f2ff0ee5e4303 ("Btrfs:
fix metadata inconsistencies after directory fsync"), to fix some metadata
inconsistencies that in the meanwhile no longer need this behaviour due to
numerous other changes that happened throughout the years.
This behaviour, besides not needed anymore, it's also undesirable because:
1) It's not reliable because it's only triggered for the directories
of dentries (dir items) that happen to be present on a leaf that
was changed in the current transaction. If a dentry that points to
a directory resides on a leaf that was not changed in the current
transaction, then it's not logged, as at log_dir_items() and
log_new_dir_dentries() we use btrfs_search_forward();
2) It's not required by posix or any standard, it's undefined territory.
The only way to guarantee a subdirectory is logged, it to explicitly
fsync it;
Making the behaviour guaranteed would require scanning all directory
items, check which point to a directory, and then fsync each subdirectory
which was modified in the current transaction. This could be very
expensive for large directories with many subdirectories and/or large
subdirectories.
So remove that obsolete logic.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:20:00 +0000 (12:20 +0000)]
btrfs: stop copying old dir items when logging a directory
When logging a directory, we go over every leaf of the subvolume tree that
was changed in the current transaction and copy all its dir index keys to
the log tree.
That includes copying dir index keys created in past transactions. This is
done mostly for simplicity, as after logging the keys we log an item that
specifies the start and end ranges of the keys we logged. That item is
then used during log replay to figure out which keys need to be deleted -
every key in that range that we find in the subvolume tree and is not in
the log tree, needs to be deleted.
Now that we log only dir index keys, and not dir item keys anymore, when
we remove dentries from a directory (due to unlink and rename operations),
we can get entire leaves that we changed only for deleting old dir index
keys, or that have few dir index keys that are new - this is due to the
fact that the offset for new index keys comes from a monotonically
increasing counter.
We can avoid logging dir index keys from past transactions, and in order
to track the deletions, only log range items (BTRFS_DIR_LOG_INDEX_KEY key
type) when we find gaps between consecutive index keys. This massively
reduces the amount of logged metadata when we have deleted directory
entries, even if it's a small percentage of the total number of entries.
The reduction comes from both less items that are logged and instead of
logging many dir index items (struct btrfs_dir_item), which have a size
of 30 bytes plus a file name, we typically log just a few range items
(struct btrfs_dir_log_item), which take only 8 bytes each.
Even if no entries were deleted from a directory and only new entries
were added, we typically still get a reduction on the amount of logged
metadata, because it's very likely the first leaf that got the new
dir index entries also has several old dir index entries.
So change the logging logic to not log dir index keys created in past
transactions and log a range item for every gap it finds between each
pair of consecutive index keys, to ensure deletions are tracked and
replayed on log replay.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
The following test was run on a branch without this patchset and on a
branch with the first three patches applied:
dur=$(( (end - start) / 1000000 ))
echo "dir fsync took $dur ms after deleting $NUM_FILE_DELETES files"
echo
umount $MNT
The test was run on a non-debug kernel (Debian's default kernel config),
and the results were the following for various values of NUM_FILES and
NUM_FILE_DELETES:
Filipe Manana [Wed, 15 Dec 2021 12:19:59 +0000 (12:19 +0000)]
btrfs: put initial index value of a directory in a constant
At btrfs_set_inode_index_count() we refer twice to the number 2 as the
initial index value for a directory (when it's empty), with a proper
comment explaining the reason for that value. In the next patch I'll
have to use that magic value in the directory logging code, so put
the value in a #define at btrfs_inode.h, to avoid hardcoding the
magic value again at tree-log.c.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Wed, 15 Dec 2021 12:19:58 +0000 (12:19 +0000)]
btrfs: don't log unnecessary boundary keys when logging directory
Before we start to log dir index keys from a leaf, we check if there is a
previous index key, which normally is at the end of a leaf that was not
changed in the current transaction. Then we log that key and set the start
of logged range (item of type BTRFS_DIR_LOG_INDEX_KEY) to the offset of
that key. This is to ensure that if there were deleted index keys between
that key and the first key we are going to log, those deletions are
replayed in case we need to replay to the log after a power failure.
However we really don't need to log that previous key, we can just set the
start of the logged range to that key's offset plus 1. This achieves the
same and avoids logging one dir index key.
The same logic is performed when we finish logging the index keys of a
leaf and we find that the next leaf has index keys and was not changed in
the current transaction. We are logging the first key of that next leaf
and use its offset as the end of range we log. This is just to ensure that
if there were deleted index keys between the last index key we logged and
the first key of that next leaf, those index keys are deleted if we end
up replaying the log. However that is not necessary, we can avoid logging
that first index key of the next leaf and instead set the end of the
logged range to match the offset of that index key minus 1.
So avoid logging those index keys at the boundaries and adjust the start
and end offsets of the logged ranges as described above.
This patch is part of a patchset comprised of the following patches:
1/4 btrfs: don't log unnecessary boundary keys when logging directory
2/4 btrfs: put initial index value of a directory in a constant
3/4 btrfs: stop copying old dir items when logging a directory
4/4 btrfs: stop trying to log subdirectories created in past transactions
Performance test results are listed in the changelog of patch 3/4.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Sahil Kang [Wed, 5 Jan 2022 08:30:06 +0000 (00:30 -0800)]
btrfs: reuse existing pointers from btrfs_ioctl
btrfs_ioctl already contains pointers to the inode and btrfs_root
structs, so we can pass them into the subfunctions instead of the
toplevel struct file.
Signed-off-by: Sahil Kang <sahil.kang@asilaycomputing.com> Signed-off-by: David Sterba <dsterba@suse.com>
Filipe Manana [Tue, 4 Jan 2022 12:53:41 +0000 (12:53 +0000)]
btrfs: remove write and wait of struct walk_control
The ->write and ->wait fields of struct walk_control, used for log trees,
are not used since 2008, more specifically since commit d0c803c4049c5c
("Btrfs: Record dirty pages tree-log pages in an extent_io tree") and
since commit d0c803c4049c5c ("Btrfs: Record dirty pages tree-log pages in
an extent_io tree"). So just remove them, along with the function
btrfs_write_tree_block(), which is also not used anymore after removing
the ->write member.
Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
Linus Torvalds [Sun, 13 Mar 2022 17:36:38 +0000 (10:36 -0700)]
Merge tag 'x86_urgent_for_v5.17_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 fixes from Borislav Petkov:
- Free shmem backing storage for SGX enclave pages when those are
swapped back into EPC memory
- Prevent do_int3() from being kprobed, to avoid recursion
- Remap setup_data and setup_indirect structures properly when
accessing their members
- Correct the alternatives patching order for modules too
* tag 'x86_urgent_for_v5.17_rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/sgx: Free backing memory after faulting the enclave page
x86/traps: Mark do_int3() NOKPROBE_SYMBOL
x86/boot: Add setup_indirect support in early_memremap_is_setup_data()
x86/boot: Fix memremap of setup_indirect structures
x86/module: Fix the paravirt vs alternative order
Linus Torvalds [Sat, 12 Mar 2022 18:29:25 +0000 (10:29 -0800)]
Merge tag 'perf-tools-fixes-for-v5.17-2022-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools fixes from Arnaldo Carvalho de Melo:
- Fix event parser error for hybrid systems
- Fix NULL check against wrong variable in 'perf bench' and in the
parsing code
- Update arm64 KVM headers from the kernel sources
- Sync cpufeatures header with the kernel sources
* tag 'perf-tools-fixes-for-v5.17-2022-03-12' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux:
perf parse: Fix event parser error for hybrid systems
perf bench: Fix NULL check against wrong variable
perf parse-events: Fix NULL check against wrong variable
tools headers cpufeatures: Sync with the kernel sources
tools kvm headers arm64: Update KVM headers from the kernel sources
Linus Torvalds [Sat, 12 Mar 2022 18:22:43 +0000 (10:22 -0800)]
Merge tag 'drm-fixes-2022-03-12' of git://anongit.freedesktop.org/drm/drm
Pull drm kconfig fix from Dave Airlie:
"Thorsten pointed out this had fallen down the cracks and was in -next
only, I've picked it out, fixed up it's Fixes: line.
- fix regression in Kconfig"
* tag 'drm-fixes-2022-03-12' of git://anongit.freedesktop.org/drm/drm:
drm/panel: Select DRM_DP_HELPER for DRM_PANEL_EDP
Zhengjun Xing [Mon, 7 Mar 2022 15:16:27 +0000 (23:16 +0800)]
perf parse: Fix event parser error for hybrid systems
This bug happened on hybrid systems when both cpu_core and cpu_atom
have the same event name such as "UOPS_RETIRED.MS" while their event
terms are different, then during perf stat, the event for cpu_atom
will parse fail and then no output for cpu_atom.
It is because event terms in the "head" of parse_events_multi_pmu_add
will be changed to event terms for cpu_core after parsing UOPS_RETIRED.MS
for cpu_core, then when parsing the same event for cpu_atom, it still
uses the event terms for cpu_core, but event terms for cpu_atom are
different with cpu_core, the event parses for cpu_atom will fail. This
patch fixes it, the event terms should be parsed from the original
event.
This patch can work for the hybrid systems that have the same event
in more than 2 PMUs. It also can work in non-hybrid systems.
Before:
# perf stat -v -e UOPS_RETIRED.MS -a sleep 1
Using CPUID GenuineIntel-6-97-1
UOPS_RETIRED.MS -> cpu_core/period=0x1e8483,umask=0x4,event=0xc2,frontend=0x8/
Control descriptor is not initialized
UOPS_RETIRED.MS: 27378451606851848516068518485
Using CPUID GenuineIntel-6-97-1
UOPS_RETIRED.MS -> cpu_core/period=0x1e8483,umask=0x4,event=0xc2,frontend=0x8/
UOPS_RETIRED.MS -> cpu_atom/period=0x1e8483,umask=0x1,event=0xc2/
Control descriptor is not initialized
UOPS_RETIRED.MS: 19775551607695071116076950711
UOPS_RETIRED.MS: 568684 80386942348038694234
Weiguo Li [Fri, 11 Mar 2022 13:06:57 +0000 (21:06 +0800)]
perf parse-events: Fix NULL check against wrong variable
We did a null check after "tmp->symbol = strdup(...)", but we checked
"list->symbol" other than "tmp->symbol".
Reviewed-by: John Garry <john.garry@huawei.com> Signed-off-by: Weiguo Li <liwg06@foxmail.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Olsa <jolsa@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: https://lore.kernel.org/r/tencent_DF39269807EC9425E24787E6DB632441A405@qq.com Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
tools headers cpufeatures: Sync with the kernel sources
To pick the changes from:
d45476d983240937 ("x86/speculation: Rename RETPOLINE_AMD to RETPOLINE_LFENCE")
Its just a comment fixup.
This only causes these perf files to be rebuilt:
CC /tmp/build/perf/bench/mem-memcpy-x86-64-asm.o
CC /tmp/build/perf/bench/mem-memset-x86-64-asm.o
And addresses this perf build warning:
Warning: Kernel ABI header at 'tools/arch/x86/include/asm/cpufeatures.h' differs from latest version at 'arch/x86/include/asm/cpufeatures.h'
diff -u tools/arch/x86/include/asm/cpufeatures.h arch/x86/include/asm/cpufeatures.h
tools kvm headers arm64: Update KVM headers from the kernel sources
To pick the changes from:
a5905d6af492ee6a ("KVM: arm64: Allow SMCCC_ARCH_WORKAROUND_3 to be discovered and migrated")
That don't causes any changes in tooling (when built on x86), only
addresses this perf build warning:
Warning: Kernel ABI header at 'tools/arch/arm64/include/uapi/asm/kvm.h' differs from latest version at 'arch/arm64/include/uapi/asm/kvm.h'
diff -u tools/arch/arm64/include/uapi/asm/kvm.h arch/arm64/include/uapi/asm/kvm.h
Randy Dunlap [Fri, 11 Mar 2022 19:49:12 +0000 (11:49 -0800)]
ARM: Spectre-BHB: provide empty stub for non-config
When CONFIG_GENERIC_CPU_VULNERABILITIES is not set, references
to spectre_v2_update_state() cause a build error, so provide an
empty stub for that function when the Kconfig option is not set.
Fixes this build error:
arm-linux-gnueabi-ld: arch/arm/mm/proc-v7-bugs.o: in function `cpu_v7_bugs_init':
proc-v7-bugs.c:(.text+0x52): undefined reference to `spectre_v2_update_state'
arm-linux-gnueabi-ld: proc-v7-bugs.c:(.text+0x82): undefined reference to `spectre_v2_update_state'
Fixes: b9baf5c8c5c3 ("ARM: Spectre-BHB workaround") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reported-by: kernel test robot <lkp@intel.com> Cc: Russell King <rmk+kernel@armlinux.org.uk> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: patches@armlinux.org.uk Acked-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Mar 2022 20:28:21 +0000 (12:28 -0800)]
Merge tag 'riscv-for-linus-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux
Pull RISC-V fixes from Palmer Dabbelt:
- prevent users from enabling the alternatives framework (and thus
errata handling) on XIP kernels, where runtime code patching does not
function correctly.
- properly detect offset overflow for AUIPC-based relocations in
modules. This may manifest as modules calling arbitrary invalid
addresses, depending on the address allocated when a module is
loaded.
* tag 'riscv-for-linus-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux:
riscv: Fix auipc+jalr relocation range checks
riscv: alternative only works on !XIP_KERNEL
When building for Thumb2, the vectors make use of a local label. Sadly,
the Spectre BHB code also uses a local label with the same number which
results in the Thumb2 reference pointing at the wrong place. Fix this
by changing the number used for the Spectre BHB local label.
Fixes: b9baf5c8c5c3 ("ARM: Spectre-BHB workaround") Tested-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Mar 2022 19:24:58 +0000 (11:24 -0800)]
Merge tag 'mmc-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc
Pull MMC fixes from Ulf Hansson:
"MMC core:
- Restore (mostly) the busy polling for MMC_SEND_OP_COND
MMC host:
- meson-gx: Fix DMA usage of meson_mmc_post_req()"
* tag 'mmc-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/ulfh/mmc:
mmc: core: Restore (almost) the busy polling for MMC_SEND_OP_COND
mmc: meson: Fix usage of meson_mmc_post_req()
Jarkko Sakkinen [Thu, 3 Mar 2022 22:38:58 +0000 (00:38 +0200)]
x86/sgx: Free backing memory after faulting the enclave page
There is a limited amount of SGX memory (EPC) on each system. When that
memory is used up, SGX has its own swapping mechanism which is similar
in concept but totally separate from the core mm/* code. Instead of
swapping to disk, SGX swaps from EPC to normal RAM. That normal RAM
comes from a shared memory pseudo-file and can itself be swapped by the
core mm code. There is a hierarchy like this:
EPC <-> shmem <-> disk
After data is swapped back in from shmem to EPC, the shmem backing
storage needs to be freed. Currently, the backing shmem is not freed.
This effectively wastes the shmem while the enclave is running. The
memory is recovered when the enclave is destroyed and the backing
storage freed.
Sort this out by freeing memory with shmem_truncate_range(), as soon as
a page is faulted back to the EPC. In addition, free the memory for
PCMD pages as soon as all PCMD's in a page have been marked as unused
by zeroing its contents.
Cc: stable@vger.kernel.org Fixes: 1728ab54b4be ("x86/sgx: Add a page reclaimer") Reported-by: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lkml.kernel.org/r/20220303223859.273187-1-jarkko@kernel.org
Linus Torvalds [Fri, 11 Mar 2022 18:28:32 +0000 (10:28 -0800)]
Merge branch 'davidh' (fixes from David Howells)
Merge misc fixes from David Howells:
"A set of patches for watch_queue filter issues noted by Jann. I've
added in a cleanup patch from Christophe Jaillet to convert to using
formal bitmap specifiers for the note allocation bitmap.
Also two filesystem fixes (afs and cachefiles)"
* emailed patches from David Howells <dhowells@redhat.com>:
cachefiles: Fix volume coherency attribute
afs: Fix potential thrashing in afs writeback
watch_queue: Make comment about setting ->defunct more accurate
watch_queue: Fix lack of barrier/sync/lock between post and read
watch_queue: Free the alloc bitmap when the watch_queue is torn down
watch_queue: Fix the alloc bitmap size to reflect notes allocated
watch_queue: Use the bitmap API when applicable
watch_queue: Fix to always request a pow-of-2 pipe ring size
watch_queue: Fix to release page in ->release()
watch_queue, pipe: Free watchqueue state after clearing pipe ring
watch_queue: Fix filter limit check
David Howells [Fri, 11 Mar 2022 16:02:18 +0000 (16:02 +0000)]
cachefiles: Fix volume coherency attribute
A network filesystem may set coherency data on a volume cookie, and if
given, cachefiles will store this in an xattr on the directory in the
cache corresponding to the volume.
The function that sets the xattr just stores the contents of the volume
coherency buffer directly into the xattr, with nothing added; the
checking function, on the other hand, has a cut'n'paste error whereby it
tries to interpret the xattr contents as would be the xattr on an
ordinary file (using the cachefiles_xattr struct). This results in a
failure to match the coherency data because the buffer ends up being
shifted by 18 bytes.
Fix this by defining a structure specifically for the volume xattr and
making both the setting and checking functions use it.
Since the volume coherency doesn't work if used, take the opportunity to
insert a reserved field for future use, set it to 0 and check that it is
0. Log mismatch through the appropriate tracepoint.
Note that this only affects cifs; 9p, afs, ceph and nfs don't use the
volume coherency data at the moment.
Fixes: 32e150037dce ("fscache, cachefiles: Store the volume coherency data") Reported-by: Rohith Surabattula <rohiths.msft@gmail.com> Signed-off-by: David Howells <dhowells@redhat.com> Reviewed-by: Jeff Layton <jlayton@kernel.org>
cc: Steve French <smfrench@gmail.com>
cc: linux-cifs@vger.kernel.org
cc: linux-cachefs@redhat.com Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 15:58:21 +0000 (15:58 +0000)]
afs: Fix potential thrashing in afs writeback
In afs_writepages_region(), if the dirty page we find is undergoing
writeback or write to cache, but the sync_mode is WB_SYNC_NONE, we go
round the loop trying the same page again and again with no pausing or
waiting unless and until another thread manages to clear the writeback
and fscache flags.
Fix this with three measures:
(1) Advance start to after the page we found.
(2) Break out of the loop and return if rescheduling is requested.
(3) Arbitrarily give up after a maximum of 5 skips.
Fixes: 31143d5d515e ("AFS: implement basic file write support") Reported-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Tested-by: Marc Dionne <marc.dionne@auristor.com> Acked-by: Marc Dionne <marc.dionne@auristor.com> Link: https://lore.kernel.org/r/164692725757.2097000.2060513769492301854.stgit@warthog.procyon.org.uk/ Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Li Huafei [Thu, 10 Mar 2022 12:09:15 +0000 (20:09 +0800)]
x86/traps: Mark do_int3() NOKPROBE_SYMBOL
Since kprobe_int3_handler() is called in do_int3(), probing do_int3()
can cause a breakpoint recursion and crash the kernel. Therefore,
do_int3() should be marked as NOKPROBE_SYMBOL.
David Howells [Fri, 11 Mar 2022 13:24:47 +0000 (13:24 +0000)]
watch_queue: Make comment about setting ->defunct more accurate
watch_queue_clear() has a comment stating that setting ->defunct to true
preventing new additions as well as preventing notifications. Whilst
the latter is true, the first bit is superfluous since at the time this
function is called, the pipe cannot be accessed to add new event
sources.
Remove the "new additions" bit from the comment.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:24:36 +0000 (13:24 +0000)]
watch_queue: Fix lack of barrier/sync/lock between post and read
There's nothing to synchronise post_one_notification() versus
pipe_read(). Whilst posting is done under pipe->rd_wait.lock, the
reader only takes pipe->mutex which cannot bar notification posting as
that may need to be made from contexts that cannot sleep.
Fix this by setting pipe->head with a barrier in post_one_notification()
and reading pipe->head with a barrier in pipe_read().
If that's not sufficient, the rd_wait.lock will need to be taken,
possibly in a ->confirm() op so that it only applies to notifications.
The lock would, however, have to be dropped before copy_page_to_iter()
is invoked.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:24:22 +0000 (13:24 +0000)]
watch_queue: Fix the alloc bitmap size to reflect notes allocated
Currently, watch_queue_set_size() sets the number of notes available in
wqueue->nr_notes according to the number of notes allocated, but sets
the size of the bitmap to the unrounded number of notes originally asked
for.
Fix this by setting the bitmap size to the number of notes we're
actually going to make available (ie. the number allocated).
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:24:08 +0000 (13:24 +0000)]
watch_queue: Fix to always request a pow-of-2 pipe ring size
The pipe ring size must always be a power of 2 as the head and tail
pointers are masked off by AND'ing with the size of the ring - 1.
watch_queue_set_size(), however, lets you specify any number of notes
between 1 and 511. This number is passed through to pipe_resize_ring()
without checking/forcing its alignment.
Fix this by rounding the number of slots required up to the nearest
power of two. The request is meant to guarantee that at least that many
notifications can be generated before the queue is full, so rounding
down isn't an option, but, alternatively, it may be better to give an
error if we aren't allowed to allocate that much ring space.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:23:46 +0000 (13:23 +0000)]
watch_queue: Fix to release page in ->release()
When a pipe ring descriptor points to a notification message, the
refcount on the backing page is incremented by the generic get function,
but the release function, which marks the bitmap, doesn't drop the page
ref.
Fix this by calling generic_pipe_buf_release() at the end of
watch_queue_pipe_buf_release().
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:23:38 +0000 (13:23 +0000)]
watch_queue, pipe: Free watchqueue state after clearing pipe ring
In free_pipe_info(), free the watchqueue state after clearing the pipe
ring as each pipe ring descriptor has a release function, and in the
case of a notification message, this is watch_queue_pipe_buf_release()
which tries to mark the allocation bitmap that was previously released.
Fix this by moving the put of the pipe's ref on the watch queue to after
the ring has been cleared. We still need to call watch_queue_clear()
before doing that to make sure that the pipe is disconnected from any
notification sources first.
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
David Howells [Fri, 11 Mar 2022 13:23:31 +0000 (13:23 +0000)]
watch_queue: Fix filter limit check
In watch_queue_set_filter(), there are a couple of places where we check
that the filter type value does not exceed what the type_filter bitmap
can hold. One place calculates the number of bits by:
if (tf[i].type >= sizeof(wfilter->type_filter) * 8)
which is fine, but the second does:
if (tf[i].type >= sizeof(wfilter->type_filter) * BITS_PER_LONG)
which is not. This can lead to a couple of out-of-bounds writes due to
a too-large type:
(1) __set_bit() on wfilter->type_filter
(2) Writing more elements in wfilter->filters[] than we allocated.
Fix this by just using the proper WATCH_TYPE__NR instead, which is the
number of types we actually know about.
The bug may cause an oops looking something like:
BUG: KASAN: slab-out-of-bounds in watch_queue_set_filter+0x659/0x740
Write of size 4 at addr ffff88800d2c66bc by task watch_queue_oob/611
...
Call Trace:
<TASK>
dump_stack_lvl+0x45/0x59
print_address_description.constprop.0+0x1f/0x150
...
kasan_report.cold+0x7f/0x11b
...
watch_queue_set_filter+0x659/0x740
...
__x64_sys_ioctl+0x127/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
Allocated by task 611:
kasan_save_stack+0x1e/0x40
__kasan_kmalloc+0x81/0xa0
watch_queue_set_filter+0x23a/0x740
__x64_sys_ioctl+0x127/0x190
do_syscall_64+0x43/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae
The buggy address belongs to the object at ffff88800d2c66a0
which belongs to the cache kmalloc-32 of size 32
The buggy address is located 28 bytes inside of
32-byte region [ffff88800d2c66a0, ffff88800d2c66c0)
Fixes: c73be61cede5 ("pipe: Add general notification queue support") Reported-by: Jann Horn <jannh@google.com> Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Fri, 11 Mar 2022 05:15:42 +0000 (21:15 -0800)]
Merge tag 'drm-fixes-2022-03-11' of git://anongit.freedesktop.org/drm/drm
Pull drm fixes from Dave Airlie:
"As expected at this stage its pretty quiet, one sun4i mixer fix and
one i915 display flicker fix:
i915:
- fix psr screen flicker
sun4i:
- mixer format fix"
* tag 'drm-fixes-2022-03-11' of git://anongit.freedesktop.org/drm/drm:
drm/sun4i: mixer: Fix P010 and P210 format numbers
drm/i915/psr: Set "SF Partial Frame Enable" also on full update
RISC-V can do PC-relative jumps with a 32bit range using the following
two instructions:
auipc t0, imm20 ; t0 = PC + imm20 * 2^12
jalr ra, t0, imm12 ; ra = PC + 4, PC = t0 + imm12
Crucially both the 20bit immediate imm20 and the 12bit immediate imm12
are treated as two's-complement signed values. For this reason the
immediates are usually calculated like this:
..where offset is the signed offset from the auipc instruction. When
the 11th bit of offset is 0 the addition of 0x800 doesn't change the top
20 bits and imm12 considered positive. When the 11th bit is 1 the carry
of the addition by 0x800 means imm20 is one higher, but since imm12 is
then considered negative the two's complement representation means it
all cancels out nicely.
However, this addition by 0x800 (2^11) means an offset greater than or
equal to 2^31 - 2^11 would overflow so imm20 is considered negative and
result in a backwards jump. Similarly the lower range of offset is also
moved down by 2^11 and hence the true 32bit range is
Linus Torvalds [Fri, 11 Mar 2022 01:23:08 +0000 (17:23 -0800)]
Merge tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull tracing fixes from Steven Rostedt:
"Minor tracing fixes:
- Fix unregistering the same event twice. A user could disable the
same event that osnoise will disable on unregistering.
- Inform RCU of a quiescent state in the osnoise testing thread.
- Fix some kerneldoc comments"
* tag 'trace-v5.17-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
ftrace: Fix some W=1 warnings in kernel doc comments
tracing/osnoise: Force quiescent states while tracing
tracing/osnoise: Do not unregister events twice
Linus Torvalds [Fri, 11 Mar 2022 00:47:58 +0000 (16:47 -0800)]
Merge tag 'net-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bluetooth, and ipsec.
Current release - regressions:
- Bluetooth: fix unbalanced unlock in set_device_flags()
- Bluetooth: fix not processing all entries on cmd_sync_work, make
connect with qualcomm and intel adapters reliable
- Revert "xfrm: state and policy should fail if XFRMA_IF_ID 0"
- xdp: xdp_mem_allocator can be NULL in trace_mem_connect()
- eth: ice: fix race condition and deadlock during interface enslave
Current release - new code bugs:
- tipc: fix incorrect order of state message data sanity check
Previous releases - regressions:
- esp: fix possible buffer overflow in ESP transformation
- dsa: unlock the rtnl_mutex when dsa_master_setup() fails
- phy: meson-gxl: fix interrupt handling in forced mode
- smsc95xx: ignore -ENODEV errors when device is unplugged
Previous releases - always broken:
- xfrm: fix tunnel mode fragmentation behavior
- esp: fix inter address family tunneling on GSO
- tipc: fix null-deref due to race when enabling bearer
- sctp: fix kernel-infoleak for SCTP sockets
- eth: macb: fix lost RX packet wakeup race in NAPI receive
- eth: intel stop disabling VFs due to PF error responses
- eth: bcmgenet: don't claim WOL when its not available"
* tag 'net-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (50 commits)
xdp: xdp_mem_allocator can be NULL in trace_mem_connect().
ice: Fix race condition during interface enslave
net: phy: meson-gxl: improve link-up behavior
net: bcmgenet: Don't claim WOL when its not available
net: arc_emac: Fix use after free in arc_mdio_probe()
sctp: fix kernel-infoleak for SCTP sockets
net: phy: correct spelling error of media in documentation
net: phy: DP83822: clear MISR2 register to disable interrupts
gianfar: ethtool: Fix refcount leak in gfar_get_ts_info
selftests: pmtu.sh: Kill nettest processes launched in subshell.
selftests: pmtu.sh: Kill tcpdump processes launched by subshell.
NFC: port100: fix use-after-free in port100_send_complete
net/mlx5e: SHAMPO, reduce TIR indication
net/mlx5e: Lag, Only handle events from highest priority multipath entry
net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLE
net/mlx5: Fix a race on command flush flow
net/mlx5: Fix size field in bufferx_reg struct
ax25: Fix NULL pointer dereference in ax25_kill_by_device
net: marvell: prestera: Add missing of_node_put() in prestera_switch_set_base_mac_addr
net: ethernet: lpc_eth: Handle error for clk_enable
...
xdp: xdp_mem_allocator can be NULL in trace_mem_connect().
Since the commit mentioned below __xdp_reg_mem_model() can return a NULL
pointer. This pointer is dereferenced in trace_mem_connect() which leads
to segfault.
The trace points (mem_connect + mem_disconnect) were put in place to
pair connect/disconnect using the IDs. The ID is only assigned if
__xdp_reg_mem_model() does not return NULL. That connect trace point is
of no use if there is no ID.
Skip that connect trace point if xdp_alloc is NULL.
[ Toke Høiland-Jørgensen delivered the reasoning for skipping the trace
point ]
Fixes: 4a48ef70b93b8 ("xdp: Allow registering memory model without rxq reference") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/YikmmXsffE+QajTB@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Ivan Vecera [Thu, 10 Mar 2022 17:16:41 +0000 (18:16 +0100)]
ice: Fix race condition during interface enslave
Commit 5dbbbd01cbba83 ("ice: Avoid RTNL lock when re-creating
auxiliary device") changes a process of re-creation of aux device
so ice_plug_aux_dev() is called from ice_service_task() context.
This unfortunately opens a race window that can result in dead-lock
when interface has left LAG and immediately enters LAG again.
Reproducer:
```
#!/bin/sh
ip link add lag0 type bond mode 1 miimon 100
ip link set lag0
for n in {1..10}; do
echo Cycle: $n
ip link set ens7f0 master lag0
sleep 1
ip link set ens7f0 nomaster
done
```
1. Command 'ip link ... set nomaster' causes that ice_plug_aux_dev()
is called from ice_service_task() context, aux device is created
and associated device->lock is taken.
2. Command 'ip link ... set master...' calls ice's notifier under
RTNL lock and that notifier calls ice_unplug_aux_dev(). That
function tries to take aux device->lock but this is already taken
by ice_plug_aux_dev() in step 1
3. Later ice_plug_aux_dev() tries to take RTNL lock but this is already
taken in step 2
4. Dead-lock
The patch fixes this issue by following changes:
- Bit ICE_FLAG_PLUG_AUX_DEV is kept to be set during ice_plug_aux_dev()
call in ice_service_task()
- The bit is checked in ice_clear_rdma_cap() and only if it is not set
then ice_unplug_aux_dev() is called. If it is set (in other words
plugging of aux device was requested and ice_plug_aux_dev() is
potentially running) then the function only clears the bit
- Once ice_plug_aux_dev() call (in ice_service_task) is finished
the bit ICE_FLAG_PLUG_AUX_DEV is cleared but it is also checked
whether it was already cleared by ice_clear_rdma_cap(). If so then
aux device is unplugged.
Signed-off-by: Ivan Vecera <ivecera@redhat.com> Co-developed-by: Petr Oros <poros@redhat.com> Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Dave Ertman <david.m.ertman@intel.com> Link: https://lore.kernel.org/r/20220310171641.3863659-1-ivecera@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jeremy Linton [Thu, 10 Mar 2022 04:55:35 +0000 (22:55 -0600)]
net: bcmgenet: Don't claim WOL when its not available
Some of the bcmgenet platforms don't correctly support WOL, yet
ethtool returns:
"Supports Wake-on: gsf"
which is false.
Ideally if there isn't a wol_irq, or there is something else that
keeps the device from being able to wakeup it should display:
"Supports Wake-on: d"
This patch checks whether the device can wakup, before using the
hard-coded supported flags. This corrects the ethtool reporting, as
well as the WOL configuration because ethtool verifies that the mode
is supported before attempting it.
Fixes: c51de7f3976b ("net: bcmgenet: add Wake-on-LAN support code") Signed-off-by: Jeremy Linton <jeremy.linton@arm.com> Tested-by: Peter Robinson <pbrobinson@gmail.com> Acked-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220310045535.224450-1-jeremy.linton@arm.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Jianglei Nie [Wed, 9 Mar 2022 12:18:24 +0000 (20:18 +0800)]
net: arc_emac: Fix use after free in arc_mdio_probe()
If bus->state is equal to MDIOBUS_ALLOCATED, mdiobus_free(bus) will free
the "bus". But bus->name is still used in the next line, which will lead
to a use after free.
We can fix it by putting the name in a local variable and make the
bus->name point to the rodata section "name",then use the name in the
error message without referring to bus to avoid the uaf.
Jakub Kicinski [Thu, 10 Mar 2022 22:32:32 +0000 (14:32 -0800)]
Merge tag 'mlx5-fixes-2022-03-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5 fixes 2022-03-09
This series provides bug fixes to mlx5 driver.
* tag 'mlx5-fixes-2022-03-09' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux:
net/mlx5e: SHAMPO, reduce TIR indication
net/mlx5e: Lag, Only handle events from highest priority multipath entry
net/mlx5: Fix offloading with ESWITCH_IPV4_TTL_MODIFY_ENABLE
net/mlx5: Fix a race on command flush flow
net/mlx5: Fix size field in bufferx_reg struct
====================
Linus Torvalds [Thu, 10 Mar 2022 20:43:06 +0000 (12:43 -0800)]
Merge tag 'staging-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
Pull staging driver fixes from Greg KH:
"Here are three small fixes for staging drivers for 5.17-rc8 or -final,
which ever comes next.
They resolve some reported problems:
- rtl8723bs wifi driver deadlock fix for reported problem that is a
revert of a previous patch. Also a documentation fix is added so
that the same problem hopefully can not come back again.
- gdm724x driver use-after-free fix for a reported problem.
All of these have been in linux-next for a while with no reported
problems"
* tag 'staging-5.17-rc8' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging:
staging: rtl8723bs: Improve the comment explaining the locking rules
staging: rtl8723bs: Fix access-point mode deadlock
staging: gdm724x: fix use after free in gdm_lte_rx()
Miaoqian Lin [Thu, 10 Mar 2022 01:53:13 +0000 (01:53 +0000)]
gianfar: ethtool: Fix refcount leak in gfar_get_ts_info
The of_find_compatible_node() function returns a node pointer with
refcount incremented, We should use of_node_put() on it when done
Add the missing of_node_put() to release the refcount.
Fixes: 7349a74ea75c ("net: ethernet: gianfar_ethtool: get phc index through drvdata") Signed-off-by: Miaoqian Lin <linmq006@gmail.com> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com> Reviewed-by: Claudiu Manoil <claudiu.manoil@nxp.com> Link: https://lore.kernel.org/r/20220310015313.14938-1-linmq006@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Linus Torvalds [Thu, 10 Mar 2022 19:43:01 +0000 (11:43 -0800)]
Merge tag 'soc-fixes-5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
Pull ARM SoC fixes from Arnd Bergmann:
"Here is a third set of fixes for the soc tree, well within the
expected set of changes.
Maintainer list changes:
- Krzysztof Kozlowski and Jisheng Zhang both have new email addresses
- Broadcom iProc has a new git tree
Regressions:
- Robert Foss sends a revert for a Mediatek DPI bridge patch that
caused an inadvertent break in the DT binding
- mstar timers need to be included in Kconfig
Devicetree fixes for:
- Aspeed ast2600 spi pinmux
- Tegra eDP panels on Nyan FHD
- Tegra display IOMMU
- Qualcomm sm8350 UFS clocks
- minor DT changes for Marvell Armada, Qualcomm sdx65, Qualcomm
sm8450, and Broadcom BCM2711"
* tag 'soc-fixes-5.17-3' of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc:
arm64: dts: marvell: armada-37xx: Remap IO space to bus address 0x0
MAINTAINERS: Update Jisheng's email address
Revert "arm64: dts: mt8183: jacuzzi: Fix bus properties in anx's DSI endpoint"
dt-bindings: drm/bridge: anx7625: Revert DPI support
ARM: dts: aspeed: Fix AST2600 quad spi group
MAINTAINERS: update Krzysztof Kozlowski's email
MAINTAINERS: Update git tree for Broadcom iProc SoCs
ARM: tegra: Move Nyan FHD panels to AUX bus
arm64: dts: armada-3720-turris-mox: Add missing ethernet0 alias
ARM: mstar: Select HAVE_ARM_ARCH_TIMER
soc: mediatek: mt8192-mmsys: Fix dither to dsi0 path's input sel
arm64: dts: mt8183: jacuzzi: Fix bus properties in anx's DSI endpoint
ARM: boot: dts: bcm2711: Fix HVS register range
arm64: dts: qcom: c630: disable crypto due to serror
arm64: dts: qcom: sm8450: fix apps_smmu interrupts
arm64: dts: qcom: sm8450: enable GCC_USB3_0_CLKREF_EN for usb
arm64: dts: qcom: sm8350: Correct UFS symbol clocks
arm64: tegra: Disable ISO SMMU for Tegra194
Revert "dt-bindings: arm: qcom: Document SDX65 platform and boards"
Linus Torvalds [Tue, 8 Mar 2022 19:55:48 +0000 (11:55 -0800)]
mm: gup: make fault_in_safe_writeable() use fixup_user_fault()
Instead of using GUP, make fault_in_safe_writeable() actually force a
'handle_mm_fault()' using the same fixup_user_fault() machinery that
futexes already use.
Using the GUP machinery meant that fault_in_safe_writeable() did not do
everything that a real fault would do, ranging from not auto-expanding
the stack segment, to not updating accessed or dirty flags in the page
tables (GUP sets those flags on the pages themselves).
The latter causes problems on architectures (like s390) that do accessed
bit handling in software, which meant that fault_in_safe_writeable()
didn't actually do all the fault handling it needed to, and trying to
access the user address afterwards would still cause faults.
Jisheng Zhang [Thu, 10 Feb 2022 16:49:43 +0000 (00:49 +0800)]
riscv: alternative only works on !XIP_KERNEL
The alternative mechanism needs runtime code patching, it can't work
on XIP_KERNEL. And the errata workarounds are implemented via the
alternative mechanism. So add !XIP_KERNEL dependency for alternative
and erratas.
Arnd Bergmann [Thu, 10 Mar 2022 14:25:45 +0000 (15:25 +0100)]
Merge tag 'mvebu-fixes-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu into arm/fixes
mvebu fixes for 5.17 (part 2)
Allow using old PCIe card on Armada 37xx
* tag 'mvebu-fixes-5.17-2' of git://git.kernel.org/pub/scm/linux/kernel/git/gclement/mvebu:
arm64: dts: marvell: armada-37xx: Remap IO space to bus address 0x0
Pali Rohár [Thu, 10 Mar 2022 10:39:23 +0000 (11:39 +0100)]
arm64: dts: marvell: armada-37xx: Remap IO space to bus address 0x0
Legacy and old PCI I/O based cards do not support 32-bit I/O addressing.
Since commit 64f160e19e92 ("PCI: aardvark: Configure PCIe resources from
'ranges' DT property") kernel can set different PCIe address on CPU and
different on the bus for the one A37xx address mapping without any firmware
support in case the bus address does not conflict with other A37xx mapping.
So remap I/O space to the bus address 0x0 to enable support for old legacy
I/O port based cards which have hardcoded I/O ports in low address space.
Note that DDR on A37xx is mapped to bus address 0x0. And mapping of I/O
space can be set to address 0x0 too because MEM space and I/O space are
separate and so do not conflict.
Remapping IO space on Turris Mox to different address is not possible to
due bootloader bug.
The kernel test robot discovered that building without
HARDEN_BRANCH_PREDICTOR issues a warning due to a missing
argument to pr_info().
Add the missing argument.
Reported-by: kernel test robot <lkp@intel.com> Fixes: 9dd78194a372 ("ARM: report Spectre v2 status through sysfs") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Linus Torvalds [Thu, 10 Mar 2022 11:55:33 +0000 (03:55 -0800)]
Merge tag 'gpio-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:
- fix a probe failure for Tegra241 GPIO controller in gpio-tegra186
- revert changes that caused a regression in the sysfs user-space
interface
- correct the debounce time conversion in GPIO ACPI
- statify a struct in gpio-sim and fix a typo
- update registers in correct order (hardware quirk) in gpio-ts4900
* tag 'gpio-fixes-for-v5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
gpio: sim: fix a typo
gpio: ts4900: Do not set DAT and OE together
gpio: sim: Declare gpio_sim_hog_config_item_ops static
gpiolib: acpi: Convert ACPI value of debounce to microseconds
gpio: Revert regression in sysfs-gpio (gpiolib.c)
gpio: tegra186: Add IRQ per bank for Tegra241
Mark Featherston [Thu, 10 Mar 2022 01:16:16 +0000 (17:16 -0800)]
gpio: ts4900: Do not set DAT and OE together
This works around an issue with the hardware where both OE and
DAT are exposed in the same register. If both are updated
simultaneously, the harware makes no guarantees that OE or DAT
will actually change in any given order and may result in a
glitch of a few ns on a GPIO pin when changing direction and value
in a single write.
Setting direction to input now only affects OE bit. Setting
direction to output updates DAT first, then OE.
Linus Torvalds [Thu, 10 Mar 2022 04:58:29 +0000 (20:58 -0800)]
Merge tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux
Pull clk fixes from Stephen Boyd:
"One more small batch of clk driver fixes:
- A fix for the Qualcomm GDSC power domain delays that avoids black
screens at boot on some more recent SoCs that use a different delay
than the hard-coded delays in the driver.
- A build fix LAN966X clk driver that let it be built on
architectures that didn't have IOMEM"
* tag 'clk-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/clk/linux:
clk: lan966x: Fix linking error
clk: qcom: dispcc: Update the transition delay for MDSS GDSC
clk: qcom: gdsc: Add support to update GDSC transition delay
Linus Torvalds [Thu, 10 Mar 2022 04:44:17 +0000 (20:44 -0800)]
Merge tag 'xsa396-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip
Pull xen fixes from Juergen Gross:
"Several Linux PV device frontends are using the grant table interfaces
for removing access rights of the backends in ways being subject to
race conditions, resulting in potential data leaks, data corruption by
malicious backends, and denial of service triggered by malicious
backends:
- blkfront, netfront, scsifront and the gntalloc driver are testing
whether a grant reference is still in use. If this is not the case,
they assume that a following removal of the granted access will
always succeed, which is not true in case the backend has mapped
the granted page between those two operations.
As a result the backend can keep access to the memory page of the
guest no matter how the page will be used after the frontend I/O
has finished. The xenbus driver has a similar problem, as it
doesn't check the success of removing the granted access of a
shared ring buffer.
- blkfront, netfront, scsifront, usbfront, dmabuf, xenbus, 9p,
kbdfront, and pvcalls are using a functionality to delay freeing a
grant reference until it is no longer in use, but the freeing of
the related data page is not synchronized with dropping the granted
access.
As a result the backend can keep access to the memory page even
after it has been freed and then re-used for a different purpose.
- netfront will fail a BUG_ON() assertion if it fails to revoke
access in the rx path.
This will result in a Denial of Service (DoS) situation of the
guest which can be triggered by the backend"
* tag 'xsa396-5.17-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip:
xen/netfront: react properly to failing gnttab_end_foreign_access_ref()
xen/gnttab: fix gnttab_end_foreign_access() without page specified
xen/pvcalls: use alloc/free_pages_exact()
xen/9p: use alloc/free_pages_exact()
xen/usb: don't use gnttab_end_foreign_access() in xenhcd_gnttab_done()
xen: remove gnttab_query_foreign_access()
xen/gntalloc: don't use gnttab_query_foreign_access()
xen/scsifront: don't use gnttab_query_foreign_access() for mapped status
xen/netfront: don't use gnttab_query_foreign_access() for mapped status
xen/blkfront: don't use gnttab_query_foreign_access() for mapped status
xen/grant-table: add gnttab_try_end_foreign_access()
xen/xenbus: don't let xenbus_grant_ring() remove grants in error case
====================
selftests: pmtu.sh: Fix cleanup of processes launched in subshell.
Depending on the options used, pmtu.sh may launch tcpdump and nettest
processes in the background. However it fails to clean them up after
the tests complete.
Patch 1 allows the cleanup() function to read the list of PIDs launched
by the tests.
Patch 2 fixes the way the nettest PIDs are retrieved.
====================