I have the code:
float *mu_x_ptr;
__m128 *tmp;
__m128 *mm_mu_x;
mu_x_ptr = _aligned_malloc(4*sizeof(float), 16);
mm_mu_x = (__m128*) mu_x_ptr;
for(row = 0; row < ker_size; row++) {
tmp = (__m128*) &original[row*width + col];
*mm_mu_x = _mm_add_ps(*tmp, *mm_mu_x);
}
From this I get:
First-chance exception at 0x00ad192e in SSIM.exe: 0xC0000005: Access violation reading location 0x00000000.
Unhandled exception at 0x00ad192e in SSIM.exe: 0xC0000005: Access violation reading location 0x00000000.
The program '[4452] SSIM.exe: Native' has exited with code -1073741819 (0xc0000005)
when running the program, the error occurs at the _mm_add_ps line.
original is allocated using _aligned_malloc(..., 16); as well and passed to the function, so it shouldn't, as far as my understanding of sse goes, be that it's not alligned.
I'm wondering if anyone can see why this crashes, since I can't see why.
EDIT: Width and col is always multiples of 4. Col is 0 or 4, while width is always a multiple of 4.
EDIT2: Looks like my original array is not aligned. Wouldn't:
function(float *original);
.
.
.
orignal = _aligned_malloc(wi开发者_StackOverflow社区dth*height*sizeof(float), 16);
function(original);
_aligned_free(original);
}
Make sure that original is alligned inside of function?
Edit3: This is actually really weird. When I do:
float *orig;
orig = _aligned_malloc(width*height*sizeof(float), 16);
assert(isAligned(orig));
The assert fails with
#define isAligned(p) (((unsigned long)(p)) & 15 == 0)
I think you need to use
__m128 tmp = _mm_load_ps( &original[row * width + col] );
instead of
tmp = (__m128 *)&original[row * width + col];
EDIT: If you get access violation errors are after some offset then possibly your stride is not aligned. Either way allocate __m128 elements(which represent 4 floats). This takes care of the alignment.
Also you can get some extra performance by eliminating the arithmetic [row * width + col]. Determine your stride and increment your pointer accordingly.
tmp
will be misaligned unless width
and col
have suitable values. Ideally both width
and col
should be multiples of 4.
You might want to add some asserts to check the alignment, e.g.
#define IsAligned(p) ((((unsigned long)(p)) & 15) == 0)
float *mu_x_ptr;
__m128 *tmp;
__m128 *mm_mu_x;
assert(original != NULL && IsAligned(original));
mu_x_ptr = _aligned_malloc(4 * sizeof(float), 16);
assert(mu_x_ptr != NULL && IsAligned(mu_x_ptr));
mm_mu_x = (__m128 *)mu_x_ptr;
assert(IsAligned(mm_mu_x));
for (row = 0; row < ker_size; row++)
{
tmp = (__m128 *)&original[row * width + col];
assert(IsAligned(tmp));
*mm_mu_x = _mm_add_ps(*tmp, *mm_mu_x);
}
精彩评论