Does increasing the number of
test casestraining data in case of Precision Neural Networks may led to problems (like over开发者_开发知识库-fitting for example)..?Does it always good to increase
test casestraining data number? Will that always lead to conversion ?If no, what are these cases.. an example would be better..
Thanks,
When you say "test cases" I'm going to assume you're talking about data instances.
Let's look at several scenarios:
Antisotropy
Suppose you have a training data set with 1000 instances and they're all significantly similar to each other, but the instances in your qualification data set are significantly different from your training data. For example you have a problem where you try to estimate the function y = mx + b
.
Suppose that some of your data set provides you with samples that help you estimate m
and others help you estimate b
. If you provide your neural network with 1000 samples that help you estimate b
but only 5 samples that help you estimate m
, then your neural network will perform really poorly when it comes to estimating m
. You will be over-fitting your neural network and adding more samples that help you estimate b
will not be any helpful.
Isotropy
Now suppose that you have a proportional distribution (note that I didn't say equal) of data instances in your data set... and you want them to be proportional, because you might need more data instances to estimate m
than you would need to estimate b
. Now your data is relatively homogeneous and adding more samples will provide you with more opportunities that would help you make a better estimation of the function. With y = mx + b
you can technically have an infinite number of data instances (since the line is infinite in both directions) and it will probably help, but there is a point of diminishing returns.
Diminishing Returns
With the y = mx + b
example you could have an infinite number of data instances, but if you can estimate the function with 1,000 instances then adding 100,000 more data instances to your data set might not be useful. At some point adding more instances will not result in better fitness, thus the diminishing returns.
Now suppose that you're trying to estimate a boolean function like XOR:
A B A XOR B
1 1 0
1 0 1
0 1 1
0 0 0
In this case you simply can't add more data, and it wouldn't make sense to add any more data... there are only four valid data instances and that's ALL you have. With this example there is no point to of adding more data instances at all.
Conclusion
In general adding more data instances will depend directly on your problem: some problems might benefit from more data instances and other problems might suffer. You have to analyze your data set and you might have to do things to your data set that would make your samples be more representative of the real-world data. You have to study the problem you're trying to solve, understand its domain, understand the data samples it has and you have to plan accordingly... there is no one-size-fits-all solution in machine learning/artificial intelligence.
the overfitting problems refers to build the net with many neurons, so when you realize the training process the net adjust "too good". In other words its like fitting an polynomial of grade n and your data is of m size where n is greater than o near m. As you have so many grades in the function the fit will be better but this doesn´t mean that this curve is the best. With NN hapens the same thing, the relation between neurons and error is decreasing more like a smile.
There is no proof that more data will lead to more error, but some works make pre-analysis of data applying principal components to capture the better relations.
精彩评论