int u1, u2;
unsigned long elm1[20], _mulpre[16][20], res1[40], res2[40]; 64 bits long
res1, res2 initialized to zero.
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
simda = _mm_load_si128 ((__m128i *) &_mulpre[u1][k]);
simdb = _mm_load_si128 ((__m128i *) &res1[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res1[i + k], simdb);
simda = _mm_load_si128 ((__m128i *)&_mulpre[u2][k]);
sim开发者_运维知识库db = _mm_load_si128 ((__m128i *)&res2[i + k]);
simdb = _mm_xor_si128 (simda, simdb);
_mm_store_si128 ((__m128i *)&res2[i + k], simdb);
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
The above mentioned code is called many times in my program (profiler shows 98%).
EDIT: In the inner loop, res1[i + k] values are loaded many times for same (i + k) values. I tried with this inside the while loop, I loaded all the res1 values into simd registers (array) and use array elements inside the innermost for loop to update array elements . Once both for loops are done, I stored the array values back to the res1, re2. But computation time increases with this. Any idea where I got wrong? The idea seemed to be correct
Any suggestion to make it faster is welcome.
Unfortunately the most obvious optimisations are probably already being done by the compiler:
- You can pull
&_mulpre[u1]
and&mulpre[u2]
our of the inner loop. - You can pull
&res1[i]
our of the inner loop. - Using different variables for the two inner operations, and reordering them, might allow for better pipelining.
Possibly swapping the outer loops would improve cache locality on elm1
.
Well, you could always call it fewer times :-)
The total input & output data looks relatively small, depending on you design and expected input it might be feasible to just cache computations or do lazy evaluation instead of up-front.
There is very little you can do with a routine such as this, since loads and stores will be the dominant factor (you're doing 2 loads + 1 store = 4 bus cycles for a single computational instruction).
l = 60;
while (l)
{
for (i = 0; i < 20; i += 2)
{
u1 = (elm1[i] >> l) & 15;
u2 = (elm1[i + 1] >> l) & 15;
for (k = 0; k < 20; k += 2)
{
_mm_stream_si128 ((__m128i *)&res1[i + k],
_mm_xor_si128 (
_mm_load_si128 ((__m128i *) &_mulpre[u1][k]),
_mm_load_si128 ((__m128i *) &res1[i + k]
));
mm_stream_si128 ((__m128i *)&res2[i + k],
_mm_xor_si128 (
_mm_load_si128 ((__m128i *)&_mulpre[u2][k]),
_mm_load_si128 ((__m128i *)&res2[i + k])
));
}
}
l -= 4;
All res1, res2 values are left shifted by 4 bits.
}
- Do remember your are using intrinsic, using less _128mi/_mm128 value will speed up your program.
- try _mm_stream_si128(), it might speed up the storing process.
- try prefetch
精彩评论