git.baikalelectronics.ru Git - kernel.git/commit

author	Ard Biesheuvel <ard.biesheuvel@linaro.org>
	Tue, 27 Nov 2018 17:42:55 +0000 (18:42 +0100)
committer	Will Deacon <will.deacon@arm.com>
	Fri, 30 Nov 2018 13:58:04 +0000 (13:58 +0000)
commit	10e194454427e646295507f7123929aca378fa49
tree	fca2b4c8aa400a78212468e40211689723bd6957	tree \| snapshot
parent	f2f49dcf432b0d9cfcddc80b98b18d5d5e31c752	commit \| diff

arm64/lib: improve CRC32 performance for deep pipelines

Improve the performance of the crc32() asm routines by getting rid of
most of the branches and small sized loads on the common path.

Instead, use a branchless code path involving overlapping 16 byte
loads to process the first (length % 32) bytes, and process the
remainder using a loop that processes 32 bytes at a time.

Tested using the following test program:

  #include <stdlib.h>

  extern void crc32_le(unsigned short, char const*, int);

  int main(void)
  {
    static const char buf[4096];

    srand(20181126);

    for (int i = 0; i < 100 * 1000 * 1000; i++)
      crc32_le(0, buf, rand() % 1024);

    return 0;
  }

On Cortex-A53 and Cortex-A57, the performance regresses but only very
slightly. On Cortex-A72 however, the performance improves from

  $ time ./crc32

  real  0m10.149s
  user  0m10.149s
  sys   0m0.000s

to

  $ time ./crc32

  real  0m7.915s
  user  0m7.915s
  sys   0m0.000s

Cc: Rui Sun <sunrui26@huawei.com>
Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>