slub bulk alloc: extract objects from the per cpu slab
First piece: acceleration of retrieval of per cpu objects
If we are allocating lots of objects then it is advantageous to disable
interrupts and avoid the this_cpu_cmpxchg() operation to get these objects
faster.
Note that we cannot do the fast operation if debugging is enabled, because
we would have to add extra code to do all the debugging checks. And it
would not be fast anyway.
Note also that the requirement of having interrupts disabled avoids having
to do processor flag operations.
Allocate as many objects as possible in the fast way and then fall back to
the generic implementation for the rest of the objects.
Measurements on CPU CPU i7-4790K @ 4.00GHz
Baseline normal fastpath (alloc+free cost): 42 cycles(tsc) 10.554 ns