iosys-map: Add per-word read
Instead of always falling back to memcpy_fromio() for any size, prefer
using read{b,w,l}(). When reading struct members it's common to read
individual integer variables individually. Going through memcpy_fromio()
for each of them poses a high penalty.
Employ a similar trick as __seqprop() by using _Generic() to generate
only the specific call based on a type-compatible variable.
For a pariticular i915 workload producing GPU context switches,
__get_engine_usage_record() is particularly hot since the engine usage
is read from device local memory with dgfx, possibly multiple times
since it's racy. Test execution time for this test shows a ~12.5%
improvement with DG2:
Before:
nrepeats = 1000; min = 7.63243e+06; max = 1.01817e+07;
median = 9.52548e+06; var = 526149;
After:
nrepeats = 1000; min = 7.03402e+06; max = 8.8832e+06;
median = 8.33955e+06; var = 333113;
Other things attempted that didn't prove very useful:
1) Change the _Generic() on x86 to just dereference the memory address
2) Change __get_engine_usage_record() to do just 1 read per loop,
comparing with the previous value read
3) Change __get_engine_usage_record() to access the fields directly as it
was before the conversion to iosys-map
(3) did gave a small improvement (~3%), but doesn't seem to scale well
to other similar cases in the driver.
Additional test by Chris Wilson using gem_create from igt with some
changes to track object creation time. This happens to accidentally
stress this code path:
Pre iosys_map conversion of engine busyness:
lmem0: Creating 262144 4KiB objects took 59274.2ms
Unpatched:
lmem0: Creating 262144 4KiB objects took 108830.2ms
With readl (this patch):
lmem0: Creating 262144 4KiB objects took 61348.6ms
s/readl/READ_ONCE/
lmem0: Creating 262144 4KiB objects took 61333.2ms
So we do take a little bit more time than before the conversion, but
that is due to other factors: bringing the READ_ONCE back would be as
good as just doing this conversion.
v2:
- Remove default from _Generic() - callers wanting to read more
than u64 should use iosys_map_memcpy_from()
- Add READ_ONCE() cases dereferencing the pointer when using system
memory
v3:
- Fix precedence issue when casting inside READ_ONCE(). By not using ()
around vaddr__ the offset was not part of the cast, but rather added
to it, producing a wrong address
- Remove compiletime_assert() as READ_ONCE() already contains it
Signed-off-by: Lucas De Marchi <lucas.demarchi@intel.com>
Reviewed-by: Christian König <christian.koenig@amd.com> # v1
Reviewed-by: Thomas Zimmermann <tzimmermann@suse.de>
Link: https://patchwork.freedesktop.org/patch/msgid/20220628191016.3899428-1-lucas.demarchi@intel.com