I'm trying to train a single layer of an autoencoder using minFunc, and while the cost function appears to decrease, when enabled, the DerivativeCheck fails. The code I'm using is as close to textbook values as possible, though extremely simplified.
The loss function I'm using is the squared-error:
$ J(W; x) = \frac{1}{2}||a^{l} - x||^2 $
with $a^{l}$ equal to $\sigma(W^{T}x)$, where $\sigma$ is the sigmoid function. The gradient should therefore be:
$ \delta = (a^{l} - x)*a^{l}(1 - a^{l}) $
$ \nabla_{W} = \delta(a^{l-1})^T $
Note, that to simplify things, I've left off the bias altogether. While this will cause poor performance, it shouldn't affect the gradient check, as I'm only looking at the weight matrix. Additionally, I've tied the encoder and decoder matrices, so there is effectively a single weight matrix.
The code I'm using for the loss function is (edit: I've vectorized the loop I had and cleaned code up a little):
% loss function passed to minFunc
function [ loss, grad ] = calcLoss(theta, X, nHidden)
[nInstances, nVars] = size(X);
% we get the variables a single vector, so need to roll it into a weight matrix
W = reshape(theta(1:nVars*nHidden), nVars, nHidden);
Wp = W; % tied weight matrix
% encode each example (nInstances)
hidden = sigmoid(X*W);
% decode each sample (nInstances)
output = sigmoid(hidden*Wp);
% loss function: sum(-0.5.*(x - output).^2)
% derivative of loss: -(x - output)*f'(o)
% if f is sigmoid, then f'(o) = output.*(1-output)
diff = X - output;
error = -diff .* output .* (1 - output);
dW = hidden*error';
loss = 0.5*sum(diff(:).^2, 2) ./ nInstances;
% need to unroll gradient matrix back into a single vector
grad = dW(:) ./ nInstances;
end
Below is the code I use to run开发者_Python百科 the optimizer (for a single time, as the runtime is fairly long with all training samples):
examples = 5000;
fprintf('loading data..\n');
images = readMNIST('train-images-idx3-ubyte', examples) / 255.0;
data = images(:, :, 1:examples);
% each row is a different training sample
X = reshape(data, examples, 784);
% initialize weight matrix with random values
% W: (R^{784} -> R^{10}), W': (R^{10} -> R^{784})
numHidden = 10; % NOTE: this is extremely small to speed up DerivativeCheck
numVisible = 784;
low = -4*sqrt(6./(numHidden + numVisible));
high = 4*sqrt(6./(numHidden + numVisible));
W = low + (high-low)*rand(numVisible, numHidden);
% run optimization
options = {};
options.Display = 'iter';
options.GradObj = 'on';
options.MaxIter = 10;
mfopts.MaxFunEvals = ceil(options.MaxIter * 2.5);
options.DerivativeCheck = 'on';
options.Method = 'lbfgs';
[ x, f, exitFlag, output] = minFunc(@calcLoss, W(:), options, X, numHidden);
The results I get with the DerivitiveCheck on are generally less than 0, but greater than 0.1. I've tried similar code using batch gradient descent, and get slightly better results (some are < 0.0001, but certainly not all).
I'm not sure if I made either a mistake with my math or code. Any help would be greatly appreciated!
update
I discovered a small typo in my code (which doesn't appear in the code below) causing exceptionally bad performance. Unfortunately, I'm still getting getting less-than-good results. For example, comparison between the two gradients:
calculate check
0.0379 0.0383
0.0413 0.0409
0.0339 0.0342
0.0281 0.0282
0.0322 0.0320
with differences of up to 0.04, which I'm assuming is still failing.
Okay, I think I might have solved the problem. Generally the differences in the gradients are < 1e-4, though I do have at least one which is 6e-4. Does anyone know if this is still acceptable?
To get this result, I rewrote the code and without tying the weight matrices (I'm not sure if doing so will always cause the derivative check to fail). I've also included biases, as they didn't complicate things too badly.
Something else I realized when debugging is that it's really easy to make a mistake in the code. For example, it took me a while to catch:
grad_W1 = error_h*X';
instead of:
grad_W1 = X*error_h';
While the difference between these two lines is just the transpose of grad_W1, because of the requirement of packing/unpacking the parameters into a single vector, there's no way for Matlab to complain about grad_W1 being the wrong dimensions.
I've also included my own derivative check which gives slightly different answers than minFunc's (my deriviate check gives differences that are all below 1e-4).
fwdprop.m:
function [ hidden, output ] = fwdprop(W1, bias1, W2, bias2, X)
hidden = sigmoid(bsxfun(@plus, W1'*X, bias1));
output = sigmoid(bsxfun(@plus, W2'*hidden, bias2));
end
calcLoss.m:
function [ loss, grad ] = calcLoss(theta, X, nHidden)
[nVars, nInstances] = size(X);
[W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
[hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
err = output - X;
delta_o = err .* output .* (1.0 - output);
delta_h = W2*delta_o .* hidden .* (1.0 - hidden);
grad_W1 = X*delta_h';
grad_bias1 = sum(delta_h, 2);
grad_W2 = hidden*delta_o';
grad_bias2 = sum(delta_o, 2);
loss = 0.5*sum(err(:).^2);
grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
end
unpackParams.m:
function [ W1, bias1, W2, bias2 ] = unpackParams(params, nVisible, nHidden)
mSize = nVisible*nHidden;
W1 = reshape(params(1:mSize), nVisible, nHidden);
offset = mSize;
bias1 = params(offset+1:offset+nHidden);
offset = offset + nHidden;
W2 = reshape(params(offset+1:offset+mSize), nHidden, nVisible);
offset = offset + mSize;
bias2 = params(offset+1:end);
end
packParams.m
function [ params ] = packParams(W1, bias1, W2, bias2)
params = [W1(:); bias1; W2(:); bias2(:)];
end
checkDeriv.m:
function [check] = checkDeriv(X, theta, nHidden, epsilon)
[nVars, nInstances] = size(X);
[W1, bias1, W2, bias2] = unpackParams(theta, nVars, nHidden);
[hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
err = output - X;
delta_o = err .* output .* (1.0 - output);
delta_h = W2*delta_o .* hidden .* (1.0 - hidden);
grad_W1 = X*delta_h';
grad_bias1 = sum(delta_h, 2);
grad_W2 = hidden*delta_o';
grad_bias2 = sum(delta_o, 2);
check = zeros(size(theta, 1), 2);
grad = packParams(grad_W1, grad_bias1, grad_W2, grad_bias2);
for i = 1:size(theta, 1)
Jplus = calcHalfDeriv(X, theta(:), i, nHidden, epsilon);
Jminus = calcHalfDeriv(X, theta(:), i, nHidden, -epsilon);
calcGrad = (Jplus - Jminus)/(2*epsilon);
check(i, :) = [calcGrad grad(i)];
end
end
checkHalfDeriv.m:
function [ loss ] = calcHalfDeriv(X, theta, i, nHidden, epsilon)
theta(i) = theta(i) + epsilon;
[nVisible, nInstances] = size(X);
[W1, bias1, W2, bias2] = unpackParams(theta, nVisible, nHidden);
[hidden, output] = fwdprop(W1, bias1, W2, bias2, X);
err = output - X;
loss = 0.5*sum(err(:).^2);
end
Update
Okay, I've also figured out why tying the weights was causing issues. I wanted to go down to just [ W1; bias1; bias2 ]
since W2 = W1'
. This way I could simply recreate W2 by looking at W1. However, because the values of $\theta$ are changed by epsilon, this was in effect changing both matrices at the same time. The proper solution is to simply pass W1
as a separate parameter while at the same time reducing $\theta$.
Update 2
Okay, this is what I get for posting too late at night. While the first update does indeed cause things to pass correctly, it's not the correct solution.
I think the correct thing to do is to actually calculate the gradients for W1 and W2, and then set the final gradient of W1 to grad_W1 to grad_W2. The hand-waving argument is that since the weight matrix is acting to both encode and decode, its weights must be affected by both gradients. I haven't thought through the actual theoretical ramifications of this yet, however.
If I run this using my own derivative check, it passes the 10e-4 threshold. It does much better than before with minFunc's derivative check, though still worse than if I don't tie the weights.
精彩评论