How to improve performance without going parallel for my backprop ANN_问答_开发者

After profiling my Back propagation algorithm, I have learnt it is responsible for taking up 60% of my computation time. Before I start looking at parallel alternatives, I would like to see if there is anything further I can do.

The activate(const double input[]) function is profiled to only take ~5% of the time. The gradient(const double input) function is implemented as follows:

inline double gradient(const double input) { return (1 - (input * input)); }

The training function in question:

void train(const vector<double>& data, const vector<double>& desired, const double learn_rate, const double momentum) {
   开发者_开发百科     this->activate(data);
        this->calculate_error(desired);

        // adjust weights for layers
        const auto n_layers = this->config.size();
        const auto adjustment = (1 - momentum) * learn_rate;

        for (size_t i = 1; i < n_layers; ++i) {
            const auto& inputs = i - 1 > 0 ? this->outputs[i - 1] : data;
            const auto n_inputs = this->config[i - 1];
            const auto n_neurons = this->config[i];

            for (auto j = 0; j < n_neurons; ++j) {
                const auto adjusted_error = adjustment * this->errors[i][j];

                for (auto k = 0; k < n_inputs; ++k) {
                    const auto delta = adjusted_error * inputs[k] + (momentum * this->deltas[i][j][k]);

                    this->deltas[i][j][k] = delta;
                    this->weights[i][j][k] += delta;
                }

                const auto delta = adjusted_error * this->bias + (momentum * this->deltas[i][j][n_inputs]);

                this->deltas[i][j][n_inputs] = delta;
                this->weights[i][j][n_inputs] += delta;
            }
        }
    }
}

This question may be better suited to https://codereview.stackexchange.com/.

You can't avoid an O(n^2) algorithm if you want to train/use a NN. But it is perfectly suited for vector arithmetic. For example with clever use of SSE or AVX you could process the neurons in chunks of 4 or 8 and use a multiply-add instead of two separate instructions.

If you use a modern compiler and carefully reformulate the algorithm and use the right switches, you might even get the compiler to autovectorize the loops for you, but your mileage may vary.

For gcc, autovectorization is activated using -O3 or -ftree-vectorize. You need an vector capable cpu of course, something like -march=core2 -mssse4.1 or similar, depending on the target cpu. If you use -ftree-vectorizer-verbose=2 you get detailed explanations, why and where loops were not vectorized. Have a look at http://gcc.gnu.org/projects/tree-ssa/vectorization.html .

Better is of course using the compiler intrinsics directly.

You want to eliminate the conditional from inside your loop here:

const double lower_layer_output = i > 0 ? outputs[lower_layer][k] : input[k]; // input layer semantics

You can eliminate this condition by calculting the zero'th iteration (the special case of i==0) earlier.

        deltas[i][j][k] = delta;
        weights[i][j][k] += delta;

You mention using std::vector, so this is a vector of vector of vector? Your data is not going to be contiguous (except in the sense that each vector is contigous). Consider using C style arrays.

How big are those dimensions? There may be some caching considerations if very large. E.g. you don't want that last subscript [k] to flush the L1 cache. Sometimes breaking the loop to process a smaller range of k indexes at a time can help (strip mining).

You can also experiment with unrolling your inner loops a little, e.g. try doing 4 or eight operations inside the loop. Increment by 4/8 respectively and handle any remainder in another loop. The compiler may be doing that already.

As others have mentioned using SIMD (SSE/AVX) is probably where you can find the most gain. You can either use compiler intrinsics (link is to Visual Studio but gcc has support with same syntax) or write in assembly (inlined or otherwise). As you mentioned, scaling across multiple cores is another direction. OpenMP can help you do that without a lot of pain.

Sometimes it is useful to generate an annotated assembly listing from your code to try and see where the compiler isn't doing such a great jobs.

This is an excellent general resource about optimization.

I'm not sure if the compiler can optimize it in your case but getting out inverse_momentum * (learn_rate * errors[i][j]) to a variable outside to loop "k" in the lower loops might decrease the load on the CPU.

BTW you are profiling a release binary and not a debug one aren't you.

I'm not fond of valarray, but I have a hunch that there is quite some opportunity for those around here.

Blitz++ (boost) seems to have a better aura around the web, but I don't know it :)

I was starting to work on a PoC myself, but there are too many missing bits of code

void activate(const double input[]) { /* ??? */ }
const unsigned int n_layers_ns;
const unsigned int n_layers;
const unsigned int output_layer_s;
const unsigned int output_layer;

T/*double?*/ bias = 1/*.0f?/;

const unsigned int config[];
double outputs[][];
double errors [][];
double weights[][][];
double deltas [][][];

Now it follows logically from the code that at least the first (rank-0) indices to the arrays are defined by the 4 missing constants. If these constants can be known compile time, these would make great value class template parameters:

template <unsigned int n_layers_ns = 2,
          unsigned int n_layers = 3>
struct Backprop {
    void train(const double input[], const double desired[], const double learn_rate, const double momentum);

    void activate(const double input[]) { }
    enum _statically_known
    {
        output_layer = n_layers_ns - 1,
        output_layer_s = n_layers - 1, // output_layer with input layer semantics (for config use only)

        n_hidden_layers = output_layer - 1,
    };

    static const double bias = 1.0f;

    const unsigned int config[];
    double outputs[3][50];      // if these dimensions could be statically known, 
    double errors[3][50];       //        slap them in valarrays and 
    double weights[3][50][50];  //        see what the compiler does with that!
    double deltas[3][50][50];   // 
};

template <unsigned int n_layers_ns,
          unsigned int n_layers>
void Backprop<n_layers_ns, n_layers>::train(const double input[], const double desired[], const double learn_rate, const double momentum) {
    activate(input);

    // calculated constants
    const double inverse_momentum = 1.0 - momentum;
    const unsigned int n_outputs = config[output_layer_s];

    // calculate error for output layer
    const double *output_layer_input = output_layer > 0 ? outputs[output_layer] : input; // input layer semantics
    for (unsigned int j = 0; j < n_outputs; ++j) {
        //errors[output_layer][j] = f'(outputs[output_layer][j]) * (desired[j] - outputs[output_layer][j]);
        errors[output_layer][j] = gradient(output_layer_input[j]) * (desired[j] - output_layer_input[j]);
    }
    [... snip ...]

Notice how I reordered the statements in the first loop a bit, to make the loop trivial. Now, I can imagine those last lines becoming

// calculate error for output layer
const std::valarray<double> output_layer_input = output_layer>0? outputs[output_layer] : input; // input layer semantics
errors[output_layer] = output_layer_input.apply(&gradient) * (desired - output_layer_input);

This will require the proper (g)slices to be setup for the inputs. I cannot work out how these would have to be dimensioned. The crux of the matter is that as long as these slice dimensions can be statically determined by the compiler, you will have the potential for signifcant time savings, since the compiler can optimize these into vectorized operations on either the FPU stack or the using the SSE4 instruction set. I suppose you would declare your output something like this:

std::valarray<double> rawoutput(/*capacity?*/);
std::valarray<double> outputs = rawoutput[std::slice(0, n_outputs, n_layers)]; // guesswork

(I expect weights and deltas would have to become gslices because AFAICT they are 3-dimensional)

Miscellanea (alignment, layout)

I realized that there probably won't be much gain if the arrays ranks (dimensions) aren't optimally ordered (e.g. the first rank in a valarray is relatively small, say 8). This could hamper vectorization because the participating elements could be scattered in memory, where I suppose optimization requires them to be adjacent.

In this light it is important to realize that the 'optimal' ordering of the ranks is ultimately dependent on the access patterns alone (so, profile and inspect again).

Also, the opportunity for optimization might be hampered by unfortunate memory alignment [1]. In that light, you might want to switch the order of (val)array ranks and round rank dimensions to nearest powers of 2 (or mor practically, say multiples of 32 bytes).

If all this actually makes a big impact (profile/inspect generated code first!) I would imagine support

that Blitz++ or boost might contain helpers (allocators?) to enforce alignments
your compiler will have attributes (of the align and/or restrict kind) to tell them they can assume these alignments for input pointers

Unrelated:

If the order of execution is not crucial (i.e. the relative orders of magnitude of factors are very similar), instead of

inverse_momentum * (learn_rate * ???)

you could take

(inverse_momentum * learn_rate) * ???

and precalculate the first subproduct. However, from the fact that it is explicitely ordered this way, I'm guessing that this would lead to more noise.

[1] disclaimer: I haven't actually done any analysis on that, I'm just throwing it out there so you don't miss the 'though joint' (how's that for Engrish)