This is variant of Fast search of some nibbles in two ints at same offset (C, microoptimisation) question with different task:
The task is to find a predefined nibble in int32 and replace it with other nibble. For example, nibble to search is 0x5; nibble to replace with is 0xe:
int: 0x3d542753 (input)
^ ^
output:0x3dE427E3 (output int)
There can be other pair of nibble to search and nibble to replace (known at compile time).
I checked my program, this part is one of most hot place (gprof proven, 75% of time is in the function); and it is called a very-very many times (gcov proven). Actually it is th开发者_如何学运维e 3rd or 4th loop of nested loops, with run count estimation of (n^3)*(2^n), for n=18..24.
My current code is slow (I rewrite it as function, but it is a code from loop):
static inline uint32_t nibble_replace (uint32_t A) __attribute__((always_inline))
{
int i;
uint32_t mask = 0xf;
uint32_t search = 0x5;
uint32_t replace = 0xe;
for(i=0;i<8;i++) {
if( (A&mask) == search )
A = (A & (~mask) ) // clean i-th nibble
| replace; // and replace
mask <<= 4; search <<= 4; replace <<= 4;
}
return A;
}
Is it possible to rewrite this function and macro in parallel way, using some bit logic magic? Magic is something like (t-0x11111111)&(~t)-0x88888888
and possibly usable with SSE*. Check the accepted answer of linked question to get feeling about needed magic.
My compiler is gcc452 and cpu is Intel Core2 Solo in 32bit mode (x86) or (in near future) in 64bit mode (x86-64).
This seemed like a fun question, so I wrote a solution without looking at other answers. This appears to be about 4.9x as fast on my system. On my system, it's also slightly faster than DigitalRoss's solution (~25% faster).
static inline uint32_t nibble_replace_2(uint32_t x)
{
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
y *= 15; /* This is faster than y |= y << 1; y |= y << 2; */
return x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}
I would explain how it works, but... I think explaining it spoils the fun.
Note on SIMD: This kind of stuff is very, very easy to vectorize. You don't even have to know how to use SSE or MMX. Here is how I vectorized it:
static void nibble_replace_n(uint32_t *restrict p, uint32_t n)
{
uint32_t i;
for (i = 0; i < n; ++i) {
uint32_t x = p[i];
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
y *= 15;
p[i] = x ^ (((SEARCH ^ REPLACE) * ONES) & y);
}
}
Using GCC, this function will automatically be converted to SSE code at -O3
, assuming proper use of the -march
flag. You can pass -ftree-vectorizer-verbose=2
to GCC to ask it to print out which loops are vectorized, e.g.:
$ gcc -std=gnu99 -march=native -O3 -Wall -Wextra -o opt opt.c
opt.c:66: note: LOOP VECTORIZED.
Automatic vectorization gave me an extra speed gain of about 64%, and I didn't even have to reach for the processor manual.
Edit: I noticed an additional 48% speedup by changing the types in the auto-vectorized version from uint32_t
to uint16_t
. This brings the total speedup to about 12x over the original. Changing to uint8_t
causes vectorization to fail. I suspect there's some significant extra speed to be found with hand assembly, if it's that important.
Edit 2: Changed *= 7
to *= 15
, this invalidates the speed tests.
Edit 3: Here's a change that is obvious in retrospect:
static inline uint32_t nibble_replace_2(uint32_t x)
{
uint32_t SEARCH = 0x5, REPLACE = 0xE, ONES = 0x11111111;
uint32_t y = (~(ONES * SEARCH)) ^ x;
y &= y >> 2;
y &= y >> 1;
y &= ONES;
return x ^ (y * (SEARCH ^ REPLACE));
}
精彩评论