git.baikalelectronics.ru Git - kernel.git/commit

author	Clement Courbet <courbet@google.com>
	Wed, 3 Mar 2021 22:46:53 +0000 (14:46 -0800)
committer	Peter Zijlstra <peterz@infradead.org>
	Wed, 10 Mar 2021 08:51:49 +0000 (09:51 +0100)
commit	af7c7bf146b1f9a7219f7e36ad119282dec51958
tree	65052ced18c491a4a3b229f5dc780b94efa3f063	tree \| snapshot
parent	a440e9f121f0f65fdc056ba41f0bbd1edfd16265	commit \| diff

sched: Optimize __calc_delta()

A significant portion of __calc_delta() time is spent in the loop
shifting a u64 by 32 bits. Use `fls` instead of iterating.

This is ~7x faster on benchmarks.

The generic `fls` implementation (`generic_fls`) is still ~4x faster
than the loop.
Architectures that have a better implementation will make use of it. For
example, on x86 we get an additional factor 2 in speed without dedicated
implementation.

On GCC, the asm versions of `fls` are about the same speed as the
builtin. On Clang, the versions that use fls are more than twice as
slow as the builtin. This is because the way the `fls` function is
written, clang puts the value in memory:
https://godbolt.org/z/EfMbYe. This bug is filed at
https://bugs.llvm.org/show_bug.cgi?idI406.

```
name                                   cpu/op
BM_Calc<__calc_delta_loop>             9.57ms Â=B112%
BM_Calc<__calc_delta_generic_fls>      2.36ms Â=B113%
BM_Calc<__calc_delta_asm_fls>          2.45ms Â=B113%
BM_Calc<__calc_delta_asm_fls_nomem>    1.66ms Â=B112%
BM_Calc<__calc_delta_asm_fls64>        2.46ms Â=B113%
BM_Calc<__calc_delta_asm_fls64_nomem>  1.34ms Â=B115%
BM_Calc<__calc_delta_builtin>          1.32ms Â=B111%
```

Signed-off-by: Clement Courbet <courbet@google.com>
Signed-off-by: Josh Don <joshdon@google.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20210303224653.2579656-1-joshdon@google.com

kernel/sched/fair.c		diff \| blob \| history
kernel/sched/sched.h		diff \| blob \| history