There's various activation functions: sigmoid, tanh, etc. And there's also a few initializer functions: Nguyen and Widrow, random, normalized, constant, zero, etc. So do these have much effect on the outcome of a neural network sp开发者_C百科ecialising in face detection? Right now I'm using the Tanh activation function and just randomising all the weights from -0.5 to 0.5. I have no idea if this is the best approach though, and with 4 hours to train the network each time, I'd rather ask on here than experiment!
Take a few hundred data cases and look at the mean and standard deviation of the activation values of your units. You want to be out of the saturation regime of the tanh sigmoid.
I doubt different reasonable initialization schemes will have much effect on the quality of your solutions. It is probably good enough to just initialize the weights to be uniform on the interval [-1/sqrt(N), +1/sqrt(N)], where N is the number of incoming connections.
That being said, what DOES tend to make a big difference is pretraining the network weights, either as an RBM or as an autoencoder. This can be helpful even for single hidden layer neural nets, although it is much more essential for deeper nets. You don't mention the architecture you are using, that information would allow a more helpful answer to your question.
There is even a new initialization rule that seems to work well described in this paper: http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/447 The paper also mentions some of the symptoms of bad initialization that I was alluding to above that you can easily check for.
To summarize, Uniform on [-1/sqrt(N), +1/sqrt(N)] isn't too bad nor is the one mentioned in the paper I link to. Don't worry about it too much if you use one of those. What is very important is pretraining the weights as an autoencoder (or Restricted Boltzmann Machine), which you should look in to even if you only have a single hidden layer.
If you want to pre-train the weights as an RBM, you could switch to logistic sigmoids and even initialize the weights from a small standard deviation Gaussian without running in to trouble.
精彩评论