How to handle large data sets in Java without using too much memory_问答_开发者

I'm working in Java. I have the requirement that I must essentially compare two database queries. To do this, I take each row of the result set and assign it to a HashTable with the field name as the 'key' and the data in the field as the 'value'. I then group the entire result set of HashTables into a single Vector just as a container. So essentially to compare two querie开发者_如何学编程s I'm really iterating through two Vectors of HashTables.

I've come to find that this approach works really well for me but requires a lot of memory. Because of other design requirements, I have to do this comparison via a Vector-HashTable-like structure, and not some DB side procedure.

Does anyone have any suggestions for optimization? The optimal solution would be one that is somewhat similar to what I am doing now as most of the code is already designed around it.

Thanks

Specify the same ORDER BY clause (based on the "key") for both result sets. Then you only have to have one record from each result set in memory at once.

For example, say your results are res1 and res2.

If the key field of res1 is less than the key field of res2, res2 is missing some records; iterate res1 until its key field is equal to or greater than the key of res2.

Likewise, if the key field of res1 is greater than the key field of res2, res1 is missing some records; iterate res2 instead.

If the key fields of the current records are equal, you can compare their values, then iterate both result sets.

You can see, in this manner, that only one record from each result is required to be held in memory at a given time.

Have you looked at the Flyweight Pattern? Do you have lots of equal objects?

Perhaps this pattern might be appropriate for your 'Key', as I imagine the field names are going to be repeated for each row? If they're Strings, you can call intern() so that they'll share the same memory location with other equal Strings, as Strings are immutable.

Another possible optimization - not memory but speed - if concurrency is not an issue would be to use an ArrayList rather than a Vector - as they are not synchronized so accesses should be a little faster. Similarly, HashMap isn't synchronized and Hashtable is, so using the former might be faster too.

You don't specify what kind of comparison do you need, but I would reduce the amount of data held by the HashMap/Vector by transforming the row information into a single hash number.

Something like this:

class RowHash {
    private final int id;       // the row id 
    private final int hashCode; // summary of the whole row info 

    public RowHash( ResultSet rs ) {

        this.id = rs.getInt("id");
        // get the strings from all the data 
        this.hashCode = new StringBuilder()
                       .append( rs.getString("field1") )
                       .append( rs.getString("field2") ) 
                       .append(rs.getString("fieldN"))
                       .toString().hashCode();
    }
    public final boolean equals( Object other ) { 
        return this.hashCode() == other.hashCode();
    }
    public final int hasCode() {
       return hashCode;
    }   
}

And then store it into an ArrayList instead of a Vector which is not synchronized.

 ... 
 ResulSet rs = ... 
 while( rs.next() ) {
     arrayList.add( new RowHash( rs ) );
 }

Well that's the idea, ( and depending on the comparison you need ) is to compute a number representing the whole record, and then use that single number to see if the other query has it.

Bear in mind that this is just a concept, you'll have to modify it to suit your needs.

Another ( probably simpler ) way to reduce the amount of memory used by a program that uses a lot of strings, is to call intern() .

See this answer to compare the impact, but really it depends in your data.

Heres a before/after screenshot using intern on that answer

How to handle large data sets in Java without using too much memory

_Before

How to handle large data sets in Java without using too much memory

_After

Area in blue is memory used, in the first around 2gb in the second < 25 mb

If you can sort both of the queries results, you should adapt sorted-merge join algorithm.

You could encapsulate your own Object, for instance, a 'MyRecord' which is smaller than a HashMap, then it will be a List of 'MyRecord'.

If you have to use HashMap, use new HashMap(7,1) instead of default constructor, that could save memory, since you said fixed '8 key-value pairs' in a map

If you do not have the memory you will need external storage backing your datastructure, which is hard to do correctly (maps of weak references to your data, which all need to be rolled out to disk, etc), and you probably still will end up with bad performance when scaling.

If you really have lots and lots of data, I would suggest embedding a SQL database. Then you can generate two tables containing your data and ask the database to find out any differences, and drop the tables afterwards. I've previously played with Derby, which I found nice, but others exist.

If your dataset does not fit in to memory, then do an external sort, and after then the sort-merge join, as already pointed out in another answer.

If your dataset does fit in to memory, then just use a lot of memory - it's fastest that way.

Or if you are interested in specific optimizations just doing what you already do a little bit better - I can't help you.