Generally everything I come across 'on-the-net' with relation to SSE/MMX comes out as maths stuff for vectors and matracies. However, I'm looking for libraries of SSE optimized 'standard functions', like those provided by Agner Fog, or some of the SSE based string scanning algorithms in GCC.
As a quick general rundown: these would be things like memset, memcpy, strstr, memcmp BSR/BSF, ie an stdlib-esque built from SSE intrsuctions
I'd preferably like them to be for SSE1 (formally MMX2) using intrinsics rather than assembly, but either is fine. hopefully this not too broad a spectrum.
Upda开发者_JAVA技巧te 1
I came across some promising stuff after some searching, one library caught my eye:
- LibFreeVec: seems mac/IBM only (due to being AltiVec based), thus of little use(to me), plus I can't seem to find a direct download link, nor does it state the minimum supported SSE version
I also came across an article on a few vectorised string functions(strlen, strstr strcmp). However SSE4.2 is way out of my reach (as said before, I'd like to stick to SSE1/MMX).
Update 2
Paul R motivated me to do a little benchmarking, unfortunately as my SSE assembly coding experience is close to zip, I used someone else's (http://www.mindcontrol.org/~hplus/) benchmarking code and added to it. All tests(excluding the original, which is VC6 SP5) where compiled under VC9 SP1 with full/customized optimizations and /arch:SSE
on.
First test was one my home machine (AMD Sempron 2200+ 512mb DDR 333), capped at SSE1 (thus no vectorization by MSVC memcpy):
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 1494.0 MHz
size SSE Cycles thru-sse memcpy Cycles thru-memcpy asm Cycles thru-asm
1 kB 2879 506.75 MB/s 4132 353.08 MB/s 2655 549.51 MB/s
2 kB 4877 598.29 MB/s 7041 414.41 MB/s 5179 563.41 MB/s
4 kB 8890 656.44 MB/s 13123 444.70 MB/s 9832 593.55 MB/s
8 kB 17413 670.28 MB/s 25128 464.48 MB/s 19403 601.53 MB/s
16 kB 34569 675.26 MB/s 48227 484.02 MB/s 38303 609.43 MB/s
32 kB 68992 676.69 MB/s 95582 488.44 MB/s 75969 614.54 MB/s
64 kB 138637 673.50 MB/s 195012 478.80 MB/s 151716 615.44 MB/s
128 kB 277678 672.52 MB/s 400484 466.30 MB/s 304670 612.94 MB/s
256 kB 565227 660.78 MB/s 906572 411.98 MB/s 618394 603.97 MB/s
512 kB 1142478 653.82 MB/s 1936657 385.70 MB/s 1380146 541.23 MB/s
1024 kB 2268244 658.64 MB/s 3989323 374.49 MB/s 2917758 512.02 MB/s
2048 kB 4556890 655.69 MB/s 8299992 359.99 MB/s 6166871 484.51 MB/s
4096 kB 9307132 642.07 MB/s 16873183 354.16 MB/s 12531689 476.86 MB/s
full tests
Second test batch was done on a university workstation(Intel E6550, 2.33Ghz, 2gb DDR2 800?)
VC9 SSE/memcpy/ASM:
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 2327.2 MHz
size SSE Cycles thru-sse memcpy Cycles thru-memcpy asm Cycles thru-asm
1 kB 392 5797.69 MB/s 434 5236.63 MB/s 420 5411.18 MB/s
2 kB 882 5153.51 MB/s 707 6429.13 MB/s 714 6366.10 MB/s
4 kB 2044 4447.55 MB/s 1218 7463.70 MB/s 1218 7463.70 MB/s
8 kB 3941 4613.44 MB/s 2170 8378.60 MB/s 2303 7894.73 MB/s
16 kB 7791 4667.33 MB/s 4130 8804.63 MB/s 4410 8245.61 MB/s
32 kB 15470 4701.12 MB/s 7959 9137.61 MB/s 8708 8351.66 MB/s
64 kB 30716 4735.40 MB/s 15638 9301.22 MB/s 17458 8331.57 MB/s
128 kB 61019 4767.45 MB/s 31136 9343.05 MB/s 35259 8250.52 MB/s
256 kB 122164 4762.53 MB/s 62307 9337.80 MB/s 72688 8004.21 MB/s
512 kB 246302 4724.36 MB/s 129577 8980.15 MB/s 142709 8153.80 MB/s
1024 kB 502572 4630.66 MB/s 332941 6989.95 MB/s 290528 8010.38 MB/s
2048 kB 1105076 4211.91 MB/s 1384908 3360.86 MB/s 662172 7029.11 MB/s
4096 kB 2815589 3306.22 MB/s 4342289 2143.79 MB/s 2172961 4284.00 MB/s
full tests
As can be seen, SSE is very fast on my home system, but falls on the intel machine (probably due to bad coding?). my x86 assembly variant comes in second on my home machine, and second on the intel system (but the results look a bit inconsistent, one hug blocks it dominates the SSE1 version). the MSVC memcpy wins the intel system tests hands done, this is due to SSE2 vectorization though, on my home machine, it fails dismally, even the horrible __movsd
beats it...
pitfalls: the memory was all aligned powers of 2. cache was (hopefully) flushed. rdtsc was used for timing.
points of interest: MSVC has an (unlisted in any reference) __movsd intrinsic, it outputs the same assembly code I'm using, but it fails dismally(even when inlined!). That's probably why its unlisted.
VC9 memcpy can be forced to vectorize on my non-sse 2 machine, it will however corrupt the FPU stack, it also seems to have a bug.
This is the full source to what I used to test (including my alterations, again, credit to http://www.mindcontrol.org/~hplus/ for the original). The binaries an project files are available on request.
In conclusion, it seems a switching variant might be the best, similar to the MSVC crt one, only a lot more sturdy with more options and single once-off checks (via inline'd function pointers? or something more devious like internal direct call patch), however inlining would probably have to use a best case method instead
Update 3
A question asked by Eshan reminded of something useful and related to this, although only for bit sets and bit ops, BitMagic and be quite useful for large bit sets, it even has a nice article on SSE2 (bit) optimization. Unfortunatly, this still isn't a CRT/stdlib esque type library. its seems most of these projects are dedicated to a specific, small section (of problems).
This raises the question, would it then rather be worth will to create a open-source, probably multi-platform performance crt/stdlib project, creating various versions of the standardised functions, each optimized for certain situation as well as a 'best-case'/general use variant of the function, with either runtime branching for scalar/MMX/SSE/SSE2+ (à la MSVC) or a forced compile time scalar/SIMD swich.
This could be useful for HPC, or projects where every bit of performance counts (like games), freeing the programmer from worrying about the speed of the inbuilt functions, only requiring a small bit of tuning to find the optimal optimized variant.
Update 4
I think the nature of this question should be expanded, to include techniques that can be applied using SSE/MMX to optimization for non-vector/matrix applications, this could probably be used for 32/64bit scalar code as well. A good example is how to check for the occurence of a byte in a given 32/64/128/256 bit data type, at once using scalar techniques(bit manip), MMX & SSE/SIMD
Also, I see a lot of answers along the lines of "just use ICC", and thats a good answer, it not my kinda of answer, as firstly, ICC is not something I can use continuously (unless Intel have a free student version for windows), due to the 30 trial. secondly, and more pertinently, I'm not only after the libraries its self, but the techniques used to optimize/create the functions they contain, for my personal eddification and improvement, and so I can apply such techniques and principles to my own code (where needed), in conjunction with the use of these libraries. hopefully that clears up that part :)
Here's an article on how to use SIMD instructions to vectorize the counting of characters:
http://porg.es/blog/ridiculous-utf-8-character-counting
For simple operations such as memset, memcpy, etc, where there is very little computation, there is little point in SIMD optimisation, since memory bandwidth will usually be the limiting factor.
Maybe libSIMDx86?
http://simdx86.sourceforge.net
You can use the apple's or OpenSolaris's libc. These libc implementations contain what you are looking for. I was looking for these kind of things some 6 years back and I had to painfully write it the hard-way.
Ages ago I remember following a coding contest called 'fastcode' project. They did some awesome ground breaking optimisation for that time using Delphi. See their results page. Since it is written in Pascal's fast function call-model (copying arguments to registers) converting to C styled stdc function call-models (pushing on stack) may be a bit awkward. This project has no updates since a long-time especially, no code is written for SSE4.2.
Solaris -> src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/
Apple -> www.opensource.apple.com/source/Libc/Libc-594.9.1/
Here's a fast memcpy implementation in C that can replace the standard library version of memcpy if necessary:
http://www.danielvik.com/2010/02/fast-memcpy-in-c.html
Honestly, what I would do is just install the Intel C++ Compiler and learn the various automated SIMD optimization flags available. We've had very good experience optimizing code performance by simply compiling it with ICC.
Keep in mind that the entire STL library is basically just header files, so the whole thing is compiled into your exe/lib/dll, and as such can be optimized however you like.
ICC has many options and lets you specify (at the simplest) which SSE levels to target. You can also use it to generate a binary file with multiple code paths, such that if the optimal SSE configuration you compiled against isn't available, it'll run a different set of (still optimized) code configured for a less capable SIMD CPU.
strstr is hard to optimize because (a) \0-termination means you have to read every byte anyway, and (b) it has to be good on all the edge cases, too.
With that said, you can beat standard strstr by a factor of 10, using SSE2 ops. I've noticed that gcc 4.4 uses these ops for strlen now, but not for the other string ops. More on how to use SSE2 registers for strlen, strchr, strpbrk, etc. at mischasan.wordpress.com. Pardon my super-terse code layout.
#include <emmintrin.h> // Other standard #includes you can figure out...
static inline unsigned under(unsigned x)
{ return (x - 1) & ~x; }
static inline __m128i xmfill(char b)
{ return _mm_set1_epi8(b); }
static inline __m128i xmload(void const*p)
{ return _mm_load_si128((__m128i const*)p); }
static inline unsigned xmatch(__m128i a, __m128i b)
{ return _mm_movemask_epi8(_mm_cmpeq_epi8(a, b)); }
char const *sse_strstr(char const *tgt, char const *pat)
{
unsigned len = sse_strlen(pat);
if (len == 0) return tgt;
if (len == 1) return sse_strchr(tgt,*pat);
__m128i x, zero = {};
__m128i p0 = _m_set1_epi8(pat[0]), p1 = _m_set1_epi8(pat[1]);
uint16_t pair = *(uint16_t const*)pat;
unsigned z, m, f = 15 & (uintptr_t)tgt;
char const* p;
// Initial unaligned chunk of tgt:
if (f) {
z = xmatch(x = xmload(tgt - f), zero) >> f;
m = under(z) & ((xmatch(x,p0) & (xmatch(x,p1) >> 1)) >> f);
for (; m; m &= m - 1)
if (!memcmp((p = tgt+ffs(m)-1)+2, pat+2, len-2))
return p;
if (z)
return NULL;
tgt += 16 - f;
if (*(uint16_t const*)(tgt - 1) == pair
&& !memcmp(tgt+1, pat+2, len-2))
return tgt - 1;
}
// 16-byte aligned chunks of tgt:
while (!(z = xmatch(x = xmload(tgt), zero))) {
m = xmatch(x,p0) & (xmatch(x,p1) >> 1);
for (; m; m &= m - 1)
if (!memcmp((p = tgt+ffs(m)-1)+2, pat+2, len-2))
return p;
tgt += 16;
if (*(uint16_t const*)(tgt - 1) == pair && !memcmp(tgt+1, pat+2, len-2))
return tgt - 1;
}
// Final 0..15 bytes of tgt:
m = under(z) & xmatch(x,p0) & (xmatch(x,p1) >> 1);
for (; m; m &= m - 1)
if (!memcmp((p = tgt+ffs(m)-1)+2, pat+2, len-2))
return p;
return NULL;
}
I personally wouldn't bother trying to write super-optimized versions of libc functions trying to handle every possible scenario with good performance.
Instead, write optimized versions for specific situations, where you know enough about the problem at hand to write proper code... and where it matters. There's a semantic difference between memset
and ClearLargeBufferCacheWriteThrough
.
精彩评论