How do I use SSE(1,2,3,4) optimizations?_问答_开发者

I'm wondering if simply compiling my msvc project with sse/sse2 will have any effect. I do for example vector normalization and dot product but I do these wth math, not any specific function. Is there like sse_dot() and sse_normalize() that I shoul开发者_如何学Cd be using to actualyy take advantage, or will the compiler know?

Thanks

As I understand it, using the sse2 compiler option will result in the compiler using the scalar not vector sse2 instructions in place of normal fpu code. I don't think it will do any vectorisation. The sse2 scalar stuff is quicker than fpu for sure.

To use the vector unit you need to use either intrinsics directly ( xmmintrin.h ) or use 3rd party libs that do. If you're just doing simple vector/matrix stuff for rendering, the Bullet SDK has an sse optimised vector math lib that's not bad. IIRC the DirectX/XNAmath lib is sse optimised too.

If neither of those take your fancy, Google should turn up a number of alternatives.

Or you can avoid having to write SSE stuff explicitly by using a high performance library like Eigen, BLAS, Intel MKL, ... Unless you are working on embedded system, these libraries will be much better than whatever you can come up with.

Either write SSE code yourself (asm or intrinsics), use third party SSE-optimised libraries (e.g. IPP, MKL, etc), or use an auto-vectorizing compiler such as Intel's ICC.

Not all compilers are as smart as you thought. Even gcc might not always optimize the most obvious code. See the following example and try by yourself. Icc seems to be able to optimize the inner loop, but gcc, as I tried several settings, cannot. When necessary, you have to manually call SSE/SSE2 instructions by using SSE functions. People told me this is a good tutorial.

EDIT: The following example works with Mac/Linux gcc. But it fails icc on linux. I do not know why. BTW, icc is believed to be better than gcc on vectorization.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <emmintrin.h>

float **mm_init(int n)
{
    float **m;
    int i;
    m = (float**)malloc(n * sizeof(void*));
    for (i = 0; i < n; ++i)
        m[i] = calloc(n, sizeof(float));
    return m;
}
void mm_destroy(int n, float **m)
{
    int i;
    for (i = 0; i < n; ++i) free(m[i]);
    free(m);
}
float **mm_gen(int n)
{
    float **m;
    int i, j;
    m = mm_init(n);
    for (i = 0; i < n; ++i)
        for (j = 0; j < n; ++j)
            m[i][j] = 2 * drand48() - 1.0;
    return m;
}
// better cache performance by transposing the second matrix
float **mm_mul2(int n, float *const *a, float *const *b)
{
    int i, j, k;
    float **m, **c;
    m = mm_init(n); c = mm_init(n);
    for (i = 0; i < n; ++i) // transpose
        for (j = 0; j < n; ++j)
            c[i][j] = b[j][i];
    for (i = 0; i < n; ++i) {
        float *p = a[i], *q = m[i];
        for (j = 0; j < n; ++j) {
            float t = 0.0, *r = c[j];
            for (k = 0; k < n; ++k)
                t += p[k] * r[k];
            q[j] = t;
        }
    }
    mm_destroy(n, c);
    return m;
}
// explicit SSE optimization for the inner loop
float **mm_mul3(int n, float *const *a, float *const *b)
{
    int i, j, k;
    float **m, **c, x[4];
    m = mm_init(n); c = mm_init(n);
    for (i = 0; i < n; ++i) // transpose
        for (j = 0; j < n; ++j)
            c[i][j] = b[j][i];
    for (i = 0; i < n; ++i) {
        float *p = a[i], *q = m[i];
        for (j = 0; j < n; ++j) {
            __m128 t = _mm_setzero_ps();
            float *r = c[j];
            for (k = 0; k < n; k += 4) // four operations in one CPU cycle
                t = _mm_add_ps(t, _mm_mul_ps(_mm_load_ps(p+k), _mm_load_ps(r+k)));
            _mm_store_ps(x, t);
            q[j] = x[0] + x[1] + x[2] + x[3];
        }
    }
    mm_destroy(n, c);
    return m;
}

int main(int argc, char *argv[])
{
    int n = 100;
    float **a, **b, **m;
    clock_t t;
    if (argc > 1) n = atoi(argv[1]);
    n = (n + 3) / 4 * 4; // for simplicity, n can be divided by 4
    srand48(11);
    a = mm_gen(n); b = mm_gen(n);

    t = clock();
    m = mm_mul2(n, a, b);
    fprintf(stderr, "cache:  %lf sec; M[%d][%d]=%f\n", (double)(clock() - t) / CLOCKS_PER_SEC, n/2, n/2, m[n/2][n/2]);

    t = clock();
    m = mm_mul3(n, a, b);
    fprintf(stderr, "SSE:    %lf sec; M[%d][%d]=%f\n", (double)(clock() - t) / CLOCKS_PER_SEC, n/2, n/2, m[n/2][n/2]);

    mm_destroy(n, a); mm_destroy(n, b); mm_destroy(n, m);
    return 0;
}

If you enable SSE2, then the compiler will use it behind your back. You will never notice nor need to know, unless you intend on supporting CPUs without SSE2. This is the same as any other underlying CPU instruction.