Neural Network with softmax activation_问答_开发者

开发者 https://www.devze.com 2022-12-27 09:56 出处：网络

edit: A more pointed question: What is the derivative of softmax to be used in my gradient descent? This is more or less a research project for a course, and my understanding of NN is very/fairly

edit:

A more pointed question: What is the derivative of softmax to be used in my gradient descent?

This is more or less a research project for a course, and my understanding of NN is very/fairly limited, so please be patient :)

I am currently in the process of building a neural network that attempts to examine an input dataset and output the probability/likelihood of each classification (there are 5 different classifications). Naturally, the sum of all output nodes should add up to 1.

Currently, I have two layers, and I set the hidden layer to contain 10 nodes.

I came up with two different types of implementations

Logistic sigmoid for hidden layer activation, softmax for output activation
Softmax for both hidden layer and output activation

I am using gradient descent to find local maximums in order to adjust the hidden nodes' weights and the output nodes' weights. I am certain in that I have this correct f开发者_如何学JAVAor sigmoid. I am less certain with softmax (or whether I can use gradient descent at all), after a bit of researching, I couldn't find the answer and decided to compute the derivative myself and obtained softmax'(x) = softmax(x) - softmax(x)^2 (this returns an column vector of size n). I have also looked into the MATLAB NN toolkit, the derivative of softmax provided by the toolkit returned a square matrix of size nxn, where the diagonal coincides with the softmax'(x) that I calculated by hand; and I am not sure how to interpret the output matrix.

I ran each implementation with a learning rate of 0.001 and 1000 iterations of back propagation. However, my NN returns 0.2 (an even distribution) for all five output nodes, for any subset of the input dataset.

My conclusions:

I am fairly certain that my gradient of descent is incorrectly done, but I have no idea how to fix this.
Perhaps I am not using enough hidden nodes
Perhaps I should increase the number of layers

Any help would be greatly appreciated!

The dataset I am working with can be found here (processed Cleveland): http://archive.ics.uci.edu/ml/datasets/Heart+Disease

The gradient you use is actually the same as with squared error: output - target. This might seem surprising at first, but the trick is that a different error function is minimized:

Neural Network with softmax activation

(- \sum^N_{n=1}\sum^K_{k=1} t_{kn} log(y_{kn}))

where log is the natural logarithm, N depicts the number of training examples and K the number of classes (and thus units in the output layer). t_kn depicts the binary coding (0 or 1) of the k'th class in the n'th training example. y_kn the corresponding network output.

Showing that the gradient is correct might be a good exercise, I haven't done it myself, though.

To your problem: You can check whether your gradient is correct by numerical differentiation. Say you have a function f and an implementation of f and f'. Then the following should hold:

Neural Network with softmax activation

(f'(x) = \frac{f(x - \epsilon) - f(x + \epsilon)}{2\epsilon} + O(\epsilon^2))

please look at sites.google.com/site/gatmkorn for the open-source Desire simulation program. For the Windows version, /mydesire/neural folder has several softmax classifiers, some with softmax-specific gradient-descent algorithm.

In the examples, this works nicely for a simplemcharacter-recognition task.

ASee also

Korn, G.A.: Advanced dynamic-system Simulation, Wiley 2007

GAK

look at the link: http://www.youtube.com/watch?v=UOt3M5IuD5s the softmax derivative is: dyi/dzi= yi * (1.0 - yi);