An SSE Stdlib-esque Library?_问答_开发者_运维开发者技术经验分享

Generally everything I come across 'on-the-net' with relation to SSE/MMX comes out as maths stuff for vectors and matracies. However, I'm looking for libraries of SSE optimized 'standard functions', like those provided by Agner Fog, or some of the SSE based string scanning algorithms in GCC.

As a quick general rundown: these would be things like memset, memcpy, strstr, memcmp BSR/BSF, ie an stdlib-esque built from SSE intrsuctions

I'd preferably like them to be for SSE1 (formally MMX2) using intrinsics rather than assembly, but either is fine. hopefully this not too broad a spectrum.

Upda开发者_JAVA技巧te 1

I came across some promising stuff after some searching, one library caught my eye:

LibFreeVec: seems mac/IBM only (due to being AltiVec based), thus of little use(to me), plus I can't seem to find a direct download link, nor does it state the minimum supported SSE version

I also came across an article on a few vectorised string functions(strlen, strstr strcmp). However SSE4.2 is way out of my reach (as said before, I'd like to stick to SSE1/MMX).

Update 2

Paul R motivated me to do a little benchmarking, unfortunately as my SSE assembly coding experience is close to zip, I used someone else's (http://www.mindcontrol.org/~hplus/) benchmarking code and added to it. All tests(excluding the original, which is VC6 SP5) where compiled under VC9 SP1 with full/customized optimizations and /arch:SSE on.

First test was one my home machine (AMD Sempron 2200+ 512mb DDR 333), capped at SSE1 (thus no vectorization by MSVC memcpy):

comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 1494.0 MHz
  size  SSE Cycles      thru-sse    memcpy Cycles   thru-memcpy     asm Cycles      thru-asm
   1 kB 2879        506.75 MB/s     4132        353.08 MB/s     2655        549.51 MB/s
   2 kB 4877        598.29 MB/s     7041        414.41 MB/s     5179        563.41 MB/s
   4 kB 8890        656.44 MB/s     13123       444.70 MB/s     9832        593.55 MB/s
   8 kB 17413       670.28 MB/s     25128       464.48 MB/s     19403       601.53 MB/s
  16 kB 34569       675.26 MB/s     48227       484.02 MB/s     38303       609.43 MB/s
  32 kB 68992       676.69 MB/s     95582       488.44 MB/s     75969       614.54 MB/s
  64 kB 138637      673.50 MB/s     195012      478.80 MB/s     151716      615.44 MB/s
 128 kB 277678      672.52 MB/s     400484      466.30 MB/s     304670      612.94 MB/s
 256 kB 565227      660.78 MB/s     906572      411.98 MB/s     618394      603.97 MB/s
 512 kB 1142478     653.82 MB/s     1936657     385.70 MB/s     1380146     541.23 MB/s
1024 kB 2268244     658.64 MB/s     3989323     374.49 MB/s     2917758     512.02 MB/s
2048 kB 4556890     655.69 MB/s     8299992     359.99 MB/s     6166871     484.51 MB/s
4096 kB 9307132     642.07 MB/s     16873183        354.16 MB/s     12531689    476.86 MB/s

full tests

Second test batch was done on a university workstation(Intel E6550, 2.33Ghz, 2gb DDR2 800?)

VC9 SSE/memcpy/ASM:
comparing P-III SIMD copytest (blocksize 4096) to memcpy
calculated CPU speed: 2327.2 MHz
  size  SSE Cycles      thru-sse    memcpy Cycles   thru-memcpy     asm Cycles      thru-asm
   1 kB 392         5797.69 MB/s    434         5236.63 MB/s    420         5411.18 MB/s
   2 kB 882         5153.51 MB/s    707         6429.13 MB/s    714         6366.10 MB/s
   4 kB 2044        4447.55 MB/s    1218        7463.70 MB/s    1218        7463.70 MB/s
   8 kB 3941        4613.44 MB/s    2170        8378.60 MB/s    2303        7894.73 MB/s
  16 kB 7791        4667.33 MB/s    4130        8804.63 MB/s    4410        8245.61 MB/s
  32 kB 15470       4701.12 MB/s    7959        9137.61 MB/s    8708        8351.66 MB/s
  64 kB 30716       4735.40 MB/s    15638       9301.22 MB/s    17458       8331.57 MB/s
 128 kB 61019       4767.45 MB/s    31136       9343.05 MB/s    35259       8250.52 MB/s
 256 kB 122164      4762.53 MB/s    62307       9337.80 MB/s    72688       8004.21 MB/s
 512 kB 246302      4724.36 MB/s    129577      8980.15 MB/s    142709      8153.80 MB/s
1024 kB 502572      4630.66 MB/s    332941      6989.95 MB/s    290528      8010.38 MB/s
2048 kB 1105076     4211.91 MB/s    1384908     3360.86 MB/s    662172      7029.11 MB/s
4096 kB 2815589     3306.22 MB/s    4342289     2143.79 MB/s    2172961     4284.00 MB/s

full tests

As can be seen, SSE is very fast on my home system, but falls on the intel machine (probably due to bad coding?). my x86 assembly variant comes in second on my home machine, and second on the intel system (but the results look a bit inconsistent, one hug blocks it dominates the SSE1 version). the MSVC memcpy wins the intel system tests hands done, this is due to SSE2 vectorization though, on my home machine, it fails dismally, even the horrible __movsd beats it...

pitfalls: the memory was all aligned powers of 2. cache was (hopefully) flushed. rdtsc was used for timing.

points of interest: MSVC has an (unlisted in any reference) __movsd intrinsic, it outputs the same assembly code I'm using, but it fails dismally(even when inlined!). That's probably why its unlisted.

VC9 memcpy can be forced to vectorize on my non-sse 2 machine, it will however corrupt the FPU stack, it also seems to have a bug.

This is the full source to what I used to test (including my alterations, again, credit to http://www.mindcontrol.org/~hplus/ for the original). The binaries an project files are available on request.

In conclusion, it seems a switching variant might be the best, similar to the MSVC crt one, only a lot more sturdy with more options and single once-off checks (via inline'd function pointers? or something more devious like internal direct call patch), however inlining would probably have to use a best case method instead

Update 3

A question asked by Eshan reminded of something useful and related to this, although only for bit sets and bit ops, BitMagic and be quite useful for large bit sets, it even has a nice article on SSE2 (bit) optimization. Unfortunatly, this still isn't a CRT/stdlib esque type library. its seems most of these projects are dedicated to a specific, small section (of problems).

This raises the question, would it then rather be worth will to create a open-source, probably multi-platform performance crt/stdlib project, creating various versions of the standardised functions, each optimized for certain situation as well as a 'best-case'/general use variant of the function, with either runtime branching for scalar/MMX/SSE/SSE2+ (à la MSVC) or a forced compile time scalar/SIMD swich.

This could be useful for HPC, or projects where every bit of performance counts (like games), freeing the programmer from worrying about the speed of the inbuilt functions, only requiring a small bit of tuning to find the optimal optimized variant.

Update 4

I think the nature of this question should be expanded, to include techniques that can be applied using SSE/MMX to optimization for non-vector/matrix applications, this could probably be used for 32/64bit scalar code as well. A good example is how to check for the occurence of a byte in a given 32/64/128/256 bit data type, at once using scalar techniques(bit manip), MMX & SSE/SIMD

Also, I see a lot of answers along the lines of "just use ICC", and thats a good answer, it not my kinda of answer, as firstly, ICC is not something I can use continuously (unless Intel have a free student version for windows), due to the 30 trial. secondly, and more pertinently, I'm not only after the libraries its self, but the techniques used to optimize/create the functions they contain, for my personal eddification and improvement, and so I can apply such techniques and principles to my own code (where needed), in conjunction with the use of these libraries. hopefully that clears up that part :)

Here's an article on how to use SIMD instructions to vectorize the counting of characters:

http://porg.es/blog/ridiculous-utf-8-character-counting

For simple operations such as memset, memcpy, etc, where there is very little computation, there is little point in SIMD optimisation, since memory bandwidth will usually be the limiting factor.

Maybe libSIMDx86?

http://simdx86.sourceforge.net

You can use the apple's or OpenSolaris's libc. These libc implementations contain what you are looking for. I was looking for these kind of things some 6 years back and I had to painfully write it the hard-way.

Ages ago I remember following a coding contest called 'fastcode' project. They did some awesome ground breaking optimisation for that time using Delphi. See their results page. Since it is written in Pascal's fast function call-model (copying arguments to registers) converting to C styled stdc function call-models (pushing on stack) may be a bit awkward. This project has no updates since a long-time especially, no code is written for SSE4.2.

Solaris -> src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libc/

Apple -> www.opensource.apple.com/source/Libc/Libc-594.9.1/

Here's a fast memcpy implementation in C that can replace the standard library version of memcpy if necessary:

http://www.danielvik.com/2010/02/fast-memcpy-in-c.html

Honestly, what I would do is just install the Intel C++ Compiler and learn the various automated SIMD optimization flags available. We've had very good experience optimizing code performance by simply compiling it with ICC.

Keep in mind that the entire STL library is basically just header files, so the whole thing is compiled into your exe/lib/dll, and as such can be optimized however you like.

ICC has many options and lets you specify (at the simplest) which SSE levels to target. You can also use it to generate a binary file with multiple code paths, such that if the optimal SSE configuration you compiled against isn't available, it'll run a different set of (still optimized) code configured for a less capable SIMD CPU.

strstr is hard to optimize because (a) \0-termination means you have to read every byte anyway, and (b) it has to be good on all the edge cases, too.

With that said, you can beat standard strstr by a factor of 10, using SSE2 ops. I've noticed that gcc 4.4 uses these ops for strlen now, but not for the other string ops. More on how to use SSE2 registers for strlen, strchr, strpbrk, etc. at mischasan.wordpress.com. Pardon my super-terse code layout.

#include <emmintrin.h> // Other standard #includes you can figure out...

static inline unsigned under(unsigned x)
    { return (x - 1) & ~x; }
static inline __m128i xmfill(char b)
    { return _mm_set1_epi8(b); }
static inline __m128i xmload(void const*p)
    { return _mm_load_si128((__m128i const*)p); }
static inline unsigned xmatch(__m128i a, __m128i b)
    { return _mm_movemask_epi8(_mm_cmpeq_epi8(a, b)); }

char const *sse_strstr(char const *tgt, char const *pat)
{
    unsigned    len = sse_strlen(pat);
    if (len == 0) return tgt;
    if (len == 1) return sse_strchr(tgt,*pat);
    __m128i     x, zero = {};
    __m128i     p0 = _m_set1_epi8(pat[0]), p1 = _m_set1_epi8(pat[1]);
    uint16_t    pair = *(uint16_t const*)pat;
    unsigned    z, m, f = 15 & (uintptr_t)tgt;
    char const* p;

    // Initial unaligned chunk of tgt:
    if (f) {
        z = xmatch(x = xmload(tgt - f), zero) >> f;
        m = under(z) & ((xmatch(x,p0) & (xmatch(x,p1) >> 1)) >> f);
        for (; m; m &= m - 1)
             if (!memcmp((p = tgt+ffs(m)-1)+2, pat+2, len-2))
                return p;
        if (z)
            return NULL;
        tgt += 16 - f;
        if (*(uint16_t const*)(tgt - 1) == pair
                && !memcmp(tgt+1, pat+2, len-2))
            return tgt - 1;
    }

    // 16-byte aligned chunks of tgt:
    while (!(z = xmatch(x = xmload(tgt), zero))) {
        m = xmatch(x,p0) & (xmatch(x,p1) >> 1);
        for (; m; m &= m - 1)
             if (!memcmp((p = tgt+ffs(m)-1)+2, pat+2, len-2))
                return p;
        tgt += 16;
        if (*(uint16_t const*)(tgt - 1) == pair && !memcmp(tgt+1, pat+2, len-2))
            return tgt - 1;
    }

    // Final 0..15 bytes of tgt:
    m = under(z) & xmatch(x,p0) & (xmatch(x,p1) >> 1);
    for (; m; m &= m - 1)
        if (!memcmp((p = tgt+ffs(m)-1)+2, pat+2, len-2))
            return p;

    return NULL;
}

I personally wouldn't bother trying to write super-optimized versions of libc functions trying to handle every possible scenario with good performance.

Instead, write optimized versions for specific situations, where you know enough about the problem at hand to write proper code... and where it matters. There's a semantic difference between memset and ClearLargeBufferCacheWriteThrough.