I'm looking at the FFT example on the CUDA SDK and I'm wondering: why the CUFFT is much faster when the half of the padded data is a power of two? (half because in frequency domain half i开发者_StackOverflow社区s redundant)
What's the point in having a power of two size to work on?
I think this is your answer. It's using different algorithms
http://forums.nvidia.com/index.php?showtopic=195094
"I have been working on a similar problem. In the cuFFT manual, it is explained that cuFFT uses two different algorithms for implementing the FFTs. One is the Cooley-Tuckey method and the other is the Bluestein algorithm. When the dimensions have prime factors of only 2,3,5 and 7 e.g (675 = 3^3 x 5^5), then 675 x 675 performs much much better than say 674 x 674 or 677 x 677. This is done using the Cooley-Tuckey method. If one of the prime factors is a prime other than 2,3,5 or 7, then the FFT for that number is implemented using the Bluestein method. The Bluestein method is slower and there is also some precision loss. "
From the manual: http://developer.download.nvidia.com/compute/cuda/3_1/toolkit/docs/CUFFT_Library_3.1.pdf
The CUFFT library implements several FFT algorithms, each having different performance and accuracy. The best performance paths correspond to transform sizes that meet two criteria:
- Fit in CUDAʹs shared memory
- Are powers of a single factor (for example, powers of two)
These transforms are also the most accurate due to the numeric stability of the chosen FFT algorithm. For transform sizes that meet the first criterion but not second, CUFFT uses a more general mixed‐radix FFT algorithm that is usually slower and less numerically accurate. Therefore, if possible it is best to use sizes that are powers of two or four, or powers of other small primes (such as, three, five, or seven). In addition, the power‐of‐two FFT algorithm in CUFFT makes maximum use of shared memory by blocking sub‐transforms for signals that do not meet the first criterion.
Just to add a little more background to Ade's answer:
In general, a discrete Fourier transform is a lot of computation. A single dimenision FFT of N points takes N*N multiplications. FFT (Fast Fourier Transforms) are faster only because in case N is a power of 2, the equations can be rewritten such that you need only N * log2 N multiplications.
In most applications, you don't care about the exact number of samples. So you choose powers of two, to get the best performance.
Powers of three, or five would also work, but powers of two are the fastest, and is the easiest algorithm to write, so that has become dominant over the years.
精彩评论