I have a data set saved as a text file that basically contains a vectors stored line by line. My vector is 10k in dimensions and I have 250 such vectors. Each vector entry is a double. Here's an example:
Vector 1 -> 0.0 0.0 0.0 0.439367 0.0 .....10k such entries
Vector 2 -> 0.0 0.0 0.0 0.439367 0.0 0.0 0.0 0.0 .....10k such entries
...
...
Vector 250 -> 0.0 1.203973 0.0 0.0 0.0 .....10k such entries
开发者_如何学GoNow if I do the math, this should take up 10k X 16bytes X 250 space (assuming each vector entry is a double taking up 16bytes of space) which is ~40MB of space. However I see that the file size is shown as 9.8MB only. Am I going wrong somewhere?
The thing is I am using this data in my Java code. The space complexity of my algorithm is O(no of entries in the vector X no of entries). Even when I run my code by allocating like 4GB of memory, I still run out of heap space. What am I missing?
Thanks. Andy
After so many people guessing about the size, I have done 3 simple test, and used the Eclipse Memory Analyzer to determine the size. (Win7, 1.6.0_21 Java HotSpot (TM) 64-Bit Server VM)
double[][]
= Size: 19,2 MB Classes: 328 Objects: 2,7kDouble[][] structure
= Size: 76,5 MB Classes: 332 Objects: 2,5mArrayList<ArrayList<Double>>
= Size: 79,6 MB Classes: 330 Objects: 2,5m
256MB (java -Xmx256m Huge
) was enough to run the tests.
So I guess the problem is not the size, it could be two things:
- there is a bug in the algorithm
- the jvm does not run with 4GB
If somebody is interessed in the code:
import java.util.ArrayList;
import java.util.List;
public class Huge {
private static final int NUMBER_OF_VECTORS = 250;
private static final int VECTOR_SIZE = 10000;
//Size: 19,2 MB Classes: 328 Objects: 2,7k
public static void doulbeArray() {
double[][] structure = new double[NUMBER_OF_VECTORS][];
for(int i = 0; i < NUMBER_OF_VECTORS; i++) {
structure[i] = new double[VECTOR_SIZE];
}
}
//Size: 76,5 MB Classes: 332 Objects: 2,5m
public static void doubleWrapperArray() {
Double[][] structure = new Double[NUMBER_OF_VECTORS][];
for(int i = 0; i < NUMBER_OF_VECTORS; i++) {
structure[i] = new Double[VECTOR_SIZE];
for (int k = 0; k < VECTOR_SIZE; k++) {
structure[i][k] = Double.valueOf(Math.random());
}
}
}
//Size: 79,6 MB Classes: 330 Objects: 2,5m
public static void list() {
List<List<Double>> structure = new ArrayList<List<Double>>();
for(int i = 0; i < NUMBER_OF_VECTORS; i++) {
List<Double> vector = new ArrayList<Double>();
for (int k = 0; k < VECTOR_SIZE; k++) {
vector.add(Double.valueOf(Math.random()));
}
structure.add(vector);
}
}
}
Without seeing the code, I can't say for certain, but it sounds like you're over-allocating when you either a) read the data from the file or b) somewhere in your algorithm. I would advise that you use a tool such as visualVM to review your object allocation- it will be able to tell you how you're allocating and what mistakes you're making.
Now if I do the math, this should take up 10k X 16bytes X 250 space (assuming each vector entry is a double taking up 16bytes of space) which is ~40MB of space. However I see that the file size is shown as 9.8MB only. Am I going wrong somewhere?
Where you're going wrong is the assumption that every double
takes 16 bytes of space when saved as text. You seem to have lots of 0 values, which take only 4 bytes in string form (including separator).
Even when I run my code by allocating like 4GB of memory, I still run out of heap space. What am I missing?
That depends on your code. One reason might be that you're storing your data in an ArrayList<Double>
or (worse) TreeSet<Double>
- the Double
wrapper objects will cause a memory overhead of easily 200% - and the Set/Map structures are much worse.
Hard to say without seeing the code and VM arguments. But note that variables in your algorithm also consume memory. And that file size vs memory usage depends on how you construct your in-memory objects, for example a simple object without a double takes up space on its own.
Get a proper tool for benchmarking memory usage. Check out the TPTP Eclipse distribution.
Also, do you might want to check out sparce matrixes.
If we can't see the code (which is fair enough), all I can say is to use the -XX:+HeapDumpOnOutOfMemoryError
command line option when you start your application, then analyse the resulting heap dump with jhat
.
精彩评论