I need to copy the contents of a byte array representing an image in RGB byte order into another RGBA(4 bytes per pixel) buffer. The alpha c开发者_JAVA技巧hannel will get filled later. What would be the fastest way of achieving this?
How tricky do you want it? You could set it up to copy a 4-byte word at a time, which might be a bit faster on some 32-bit systems:
void fast_unpack(char* rgba, const char* rgb, const int count) {
if(count==0)
return;
for(int i=count; --i; rgba+=4, rgb+=3) {
*(uint32_t*)(void*)rgba = *(const uint32_t*)(const void*)rgb;
}
for(int j=0; j<3; ++j) {
rgba[j] = rgb[j];
}
}
The extra case on the end is to deal with the fact that the rgb array is missing a byte. You could also make it a bit faster using aligned moves and SSE instructions, working in multiples of 4 pixels at a time. If you're feeling really ambitious, you can try even more horribly obfuscated things like prefetching a cache line into the FP registers, for example, then blitting it across to the other image all at once. Of course the mileage you get out of these optimizations is going to be highly dependent on the specific system configuration you are targetting, and I would be really skeptical that there is much benefit at all to doing any of this instead of the simple thing.
And my simple experiments confirm that this is indeed a little bit faster, at least on my x86 machine. Here is a benchmark:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <time.h>
void fast_unpack(char* rgba, const char* rgb, const int count) {
if(count==0)
return;
for(int i=count; --i; rgba+=4, rgb+=3) {
*(uint32_t*)(void*)rgba = *(const uint32_t*)(const void*)rgb;
}
for(int j=0; j<3; ++j) {
rgba[j] = rgb[j];
}
}
void simple_unpack(char* rgba, const char* rgb, const int count) {
for(int i=0; i<count; ++i) {
for(int j=0; j<3; ++j) {
rgba[j] = rgb[j];
}
rgba += 4;
rgb += 3;
}
}
int main() {
const int count = 512*512;
const int N = 10000;
char* src = (char*)malloc(count * 3);
char* dst = (char*)malloc(count * 4);
clock_t c0, c1;
double t;
printf("Image size = %d bytes\n", count);
printf("Number of iterations = %d\n", N);
printf("Testing simple unpack....");
c0 = clock();
for(int i=0; i<N; ++i) {
simple_unpack(dst, src, count);
}
c1 = clock();
printf("Done\n");
t = (double)(c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Elapsed time: %lf\nAverage time: %lf\n", t, t/N);
printf("Testing tricky unpack....");
c0 = clock();
for(int i=0; i<N; ++i) {
fast_unpack(dst, src, count);
}
c1 = clock();
printf("Done\n");
t = (double)(c1 - c0) / (double)CLOCKS_PER_SEC;
printf("Elapsed time: %lf\nAverage time: %lf\n", t, t/N);
return 0;
}
And here are the results (compiled with g++ -O3):
Image size = 262144 bytes
Number of iterations = 10000
Testing simple unpack....Done
Elapsed time: 3.830000
Average time: 0.000383
Testing tricky unpack....Done
Elapsed time: 2.390000
Average time: 0.000239
So, maybe about 40% faster on a good day.
The fastest was would be to use a library that implements the conversion for you rather than writing it yourself. Which platform[s] are you targeting?
If you insist on writing it yourself for some reason, write a simple and correct version first. Use that. If the performance is inadequate, then you can think about optimizing it. In general, this sort of conversion is best done using vector permutes, but the exact optimal sequence varies depending on the target architecture.
struct rgb {
char r;
char g;
char b;
};
struct rgba {
char r;
char g;
char b;
char a;
}
void convert(struct rgba * dst, const struct rgb * src, size_t num)
{
size_t i;
for (i=0; i<num; i++) {
dst[i].r = src[i].r;
dst[i].g = src[i].g;
dst[i].b = src[i].b;
}
}
This would be the cleaner solution, but as you mention an array of bytes, you should use this:
// num is still the size in pixels. So dst should have space for 4*num bytes,
// while src is supposed to be of length 3*num.
void convert(char * dst, const char * src, size_t num)
{
size_t i;
for (i=0; i<num; i++) {
dst[4*i] = src[3*i];
dst[4*i+1] = src[3*i+1];
dst[4*i+2] = src[3*i+2];
}
}
I think i remmember a Nehe tutorial of doing something like that, but fast.
Its here
The interesting part is here:
void flipIt(void* buffer) // Flips The Red And Blue Bytes (256x256)
{
void* b = buffer; // Pointer To The Buffer
__asm // Assembler Code To Follow
{
mov ecx, 256*256 // Set Up A Counter (Dimensions Of Memory Block)
mov ebx, b // Points ebx To Our Data (b)
label: // Label Used For Looping
mov al,[ebx+0] // Loads Value At ebx Into al
mov ah,[ebx+2] // Loads Value At ebx+2 Into ah
mov [ebx+2],al // Stores Value In al At ebx+2
mov [ebx+0],ah // Stores Value In ah At ebx
add ebx,3 // Moves Through The Data By 3 Bytes
dec ecx // Decreases Our Loop Counter
jnz label // If Not Zero Jump Back To Label
}
}
what it does is pretty self explanatory, and it should be easy to transform this into adding the alpha byte.
Just create array with size of 4/3 of source array. Read entire array and write it to RGBA array, but after every 3bytes insert 255 for alpha.
精彩评论