I'm developing some image processing software in C++ on Intel which has to run a bicubic interpolation algorithm on small (about 1kpx) images over and over again. This takes a lot of time, and I'm aiming to speed it up. What I have now is a basic implementation based on the literature, a somewhat-improved (with regard to speed) version which doesn't do matrix multiplication, but rather uses pre-calculated formulas for parts of the interpolating polynomial and last, a fixed-point version of the matrix-multiplying code (works slower actually). I also have an external library with an optimized implementation, but it's still too slow for my needs. What I was considering next is:
- vectorization using MMX/SSE stream processing, on both the floating and fixed-point versions
- doing the interpolation in the Fourier domain using convolution
- shifting the work onto a GPU using OpenCL or simila开发者_开发问答r
Which of these approaches could yield greatest performance gains? Could you suggest another? Thanks.
I think GPU is the way to go. It's probably the most natural task for this type of hardware. I would start by looking into CUDA or OpenCL. Older techniques like simple DirectX/OpenGL pixel/fragment shaders should work just fine as well.
Some links I found, maybe they could help you:
- Efficient GPU-Based Texture Interpolation using Uniform B-Splines
- CUDA Cubic B-Spline Interpolation (CI)
- Fast Third-Order Texture Filtering
There's the Intel IPP libraries, which use SIMD internally for faster processing. The Intel IPP also uses OpenMP, if configured, you can gain benefit of relatively easy multiprocessing.
These libraries do support bicubic interpolation and are payware (you buy a development license but redistribs are free).
Be careful with going the GPU route. If your convolution kernel is too fast, you're going to end up being IO bound. You won't know for sure which is the fastest unless you implement both.
GPU Gems 2 has a chapter on Fast Third-Order Texture Filtering which should be a good starting point for your GPU solution.
A combination of Intel Threading Building Blocks and SSE instructions would make a decent CPU solution.
Not an answer for bicubic, but maybe an alternative:
if I understand you, you have 32 x 32 xy, 1024 x 768 image, and want interpolated image[xy]
.
Just rounding xy, image[ int( xy )]
, would be too grainy.
But wait — you could make a smoothed double image 2k x 1.5k, once, and take
image2[ int( 2*xy )]
: less grainy, very fast. Or similarly,
image4[ int( 4*xy )]
in a smoothed 4k x 3k image.
How well this works depends on ...
精彩评论