Nothing like std::lower_bound
is going to scale well using SSE. The reason SSE makes things faster is that it allows you to do several calculations at once. For example, a single SSE instruction might result in 4 multiply operations going on at once. However, the way std::lower_bound
operates cannot be parallelized, because each step in the algorithm requires the comparison results of previous steps. Plus, it's already O(lg n), and as a result unlikely to be a bottleneck.
Moreover, before moving to inline assembly, you should know that whenever you use inline assembly, you defeat most compiler optimizations that might occur on that section of your program, and often as a result your program will be slower -- compilers usually write better assembler than us humans do.
If you want to use SSE, you are better off using intrinsics -- special "functions" or keywords provided by the compiler, which call the SSE instruction but otherwise allow optimizations to occur. Such intrinsics are available in Microsoft's Visual C++, as well as the GNU Compiler Collection. (And probably most any compiler. Consult your compiler's documentation)
Rather than trying to speed up std::lower_bound
using SSE, you should try to not need to call it in the first place. For example, if you're constantly inserting elements into the vector using lower_bound
, you should know what you've effectively created is insertion sort, and a poor insertion sort at that, which will require quadradic time. You would likely be better off simply putting your new elements on the end of the vector, and then sorting the vector when you need it sorted, which reduces things to an O(n lg n) sort. If your data access patterns are such that you would be resorting too often then you should use something like a std::set
instead, which provides O(lg n) operations for insertions, rather than the O(n + lg n) insertions you're currently getting with the vectors.
And of course, remember to benchmark :)
精彩评论