开发者

How to unroll a short loop in C++

开发者 https://www.devze.com 2022-12-21 19:35 出处:网络
I wonder how to get something like this: Write copy(a, b, 2, 3) And then get a[2] = b[2]; a[3] = b[3]; a[4] = b[4];

I wonder how to get something like this:

  1. Write

    copy(a, b, 2, 3)
    
  2. And then get

    a[2] = b[2];
    a[3] = b[3];
    a[4] = b[4];
    
    开发者_开发知识库

I know that C #defines can't be used recursively to get that effect. But I'm using C++, so I suppose that template meta-programming might be appropriate.

I know there is a Boost library for that, but I only want that "simple" trick, and Boost is too "messy".


The most straightforward solution to this is to write a loop where the start and end values are known:

for(int i = 2; i <= 4; i++) {
  a[i]=b[i]; 
}

I think this is better than any sort of template/runtime-call mixture: The loop as written is completely clear to the compilers' optimizer, and there are no levels of function calls to dig through just to see what's going on.


C++ meta-programming is recursive.

Think of your problem in terms of recursion. Implement terminal case and non-terminal cases.

Your terminal case can be either 0 or one. Pass limits as template parameters. Use a structure/class because they allow partial specialization and other neat things:

template<int from, int to>
struct copy {
    static void apply(source, destination) {
        source[from] = destination[from];
        copy<from+1,to>:: apply(source, destination);
    }
};

// Terminal case
template<int from>
struct copy<from, from> {
    static void apply(source, destination) {
        source[from] = destination[from];
    }
};


You can probably do something like the following. Depending on your compiler and the optimziation settings that you use you may get the effect that you are looking for.

Be aware that for small objects like char it may well be slower than a std::copy or a memcpy and that for larger objects the cost of a loop is likely to be insignificant compared to the copies going on in any case.

#include <cstddef>

template<std::size_t base, std::size_t count, class T, class U>
struct copy_helper
{
    static void copy(T dst, U src)
    {
        dst[base] = src[base];
        copy_helper<base + 1, count - 1, T, U>::copy(dst, src);
    }
};

template<std::size_t base, class T, class U>
struct copy_helper<base, 0, T, U>
{
    static void copy(T, U)
    {
    }
};

template<std::size_t base, std::size_t count, class T, class U>
void copy(T dst, U src)
{
    copy_helper<base, count, T, U>::copy(dst, src);
}

template void copy<5, 9, char*, const char*>(char*, const char*);

#include <iostream>
#include <ostream>

int main()
{
    const char test2[] = "     , World\n";
    char test[14] = "Hello";

    copy<5, 9>(test, test2);

    std::cout << test;

    return 0;
}


It's important to realize that the compiler is very smart, and that tricking it to unroll loops using template metaprogramming will probably set you back further that it gets you forward.

To get the bottom out of your optimizations: keep an eye on the disassembly. This will hopefully teach you more than throwing templates at the problem.

And note, like Johannes said: if the compiler can see that you are running a loop for a fixed number of times (or a fixed multiple of times like 4x variable), it can create code very close to optimal.


Loop unrolloing using meta-programming can be used to create constexpr (I have not measured times). I have an example where it can be used to make the function combination(n, k) a cosntexpr:

template <size_t N> struct iterate_forward {
    template <class operation> static auto eval(const operation & op)
    {
        iterate_forward<N-1>::eval(op);
        return op(N-1);
    }
};
template <> struct iterate_forward<1>
{
    template <class operation> static auto eval(const operation & op)
    { return op(0); }
};
template <> struct iterate_forward<0>
{
    template <class operation> static auto eval(const operation & op) {}
};
template <size_t N, size_t K> constexpr size_t COMB()
{
    struct ratio
    {
        size_t operator () (size_t i) const { return (res *= (N-i) / (i+1)); }
        mutable size_t res = 1;
    };
    return iterate_forward<K>::eval(ratio());
} 


It doesn't use templates and it's not a "complete" unrolling, but you can partially unroll the loop with something like this:

void copy (SomeType* a, SomeType* b, int start_index, int num_items) {
    int i = start_index;

    while (num_items > 4) {
            a[i+0] = b[i+0];
            a[i+1] = b[i+1];
            a[i+2] = b[i+2];
            a[i+3] = b[i+3];
            i += 4;
            num_items -= 4;
    }
    while (num_items > 0) {
            a[i] = b[i];
            ++i;
            --num_items;
    }
}

Now in this particular example, the extra computations involved will probably outweigh the benefits from only unrolling four elements at a time. You should get an increasing benefit from an increasing number of elements inside the top loop (throughout the function, replace 4 with however many elements you are copying inside each manually-unrolled iteration).


template <int begin, int end> struct copy_;

template <int end> struct copy_<end, end> {
  template <typename T> static void execute(T& a, T& b)
  {
    a[end] = b[end];
  }
};

template <int begin, int end> struct copy_<begin, end> {
  template <typename T> static void execute(T& a, T& b)
  {
    a[begin] = b[begin];
    copy_<begin+1, end>::execute(a, b);
  }
};

template <int begin, int how_many> struct copy {
  template <typename T> static void execute(T& a, T& b)
  {
    copy_<begin, begin+how_many-1>::execute(a, b);
  }
};

copy<2, 3>::execute(a, b);


From http://www.sgi.com/tech/stl/Vector.html:

template <class InputIterator>
vector(InputIterator, InputIterator)

Creates a vector with a copy of a range. 


#define tl template
#define tn typename
#define st struct
#define cl class
#define us using


    template<tn A> st Typ { using type = A; };
    tl<tn T> using GetT = tn T::type;
    tl<tn F, tn ...As> us apply = tn F::tl apply<As...>;
    tl<tn, tn, tn ...> st LoopElements;
    tl<tn, tn> st Loop;
    tl<tn, tn, tn> st VLoop_i;
    tl<tn Sq, tn MF> us VLoop = VLoop_i<GetT<Sq>, Sq, MF>;

    //
    // TY
    //
    template<tn T> struct Ty {
        template<T ...> struct Pack : Typ<T> {};
        tl<tn ...> st Concat_i; tl<tn ...P> us Concat = GetT<Concat_i<P...>>;
        tl<T, int64_t> st Seq_i; tl<T f, T t> us Seq = GetT<Seq_i<f, ((int64_t)(t - f))>>; tl<int64_t, tn> st add;

        template<tl<T ...> tn C, T ...vs, tl<T ...> tn D, T ...ws, tn ...R> st Concat_i<C<vs...>, D<ws...>, R...> : Typ<Concat<C<vs..., ws...>, R...> >{};
        template<tl<T ...> tn C, T ...vs> st Concat_i<C<vs...>> : Typ<C<vs...> >{};

        template<int64_t x, T ...vs> struct add<x, Pack<vs...>> : Typ<Pack<((T)(vs + x))...>> {};
        template<T f, int64_t c> class Seq_i {
            using A = tn Seq_i<f, c/2>::type;
            using B = tn add<c/2, A>::type;
            using C = tn Seq_i<f + c / 2 * 2, c & 1>::type;
        public:
            using type = Concat<A, B, C>;
        };
        tl<T f> st Seq_i<f, 0> : Typ<Pack<>> {};        
        tl<T f> st Seq_i<f, 1> : Typ<Pack<f>> {};
        tl<T f> st Seq_i<f, -1> : Typ<Pack<f>> {};
    };

    //
    // LOOP
    template<size_t i, tn F, tn T> struct LoopElement { LoopElement() { apply<F, T>(); }; };
    template<size_t ...is, tn F, tn ...P> struct LoopElements<Ty<size_t>::Pack<is...>, F, P...> : LoopElement<is, F, P>... {};
    template<tn F, tl<tn ...> cl C, tn ...P> struct Loop<F, C<P...>> : LoopElements<Ty<size_t>::Seq<0, sizeof...(P)>, F, P...> {};

    template<tn T, tl<T ...> cl ST, T ...vs, tn MF> struct VLoop_i<T, ST<vs...>, MF> : LoopElements<Ty<size_t>::Seq<0, sizeof...(vs)>, MF, Val<T, vs>...> {};
0

精彩评论

暂无评论...
验证码 换一张
取 消