开发者

What is the best List implementation for Large lists in java

开发者 https://www.devze.com 2022-12-11 15:20 出处:网络
I have to create a large list of n elements (could be up to 100,000). each element in the list is an integer equivalent to the index of the list. After this I have to call Collections.shuffle on this

I have to create a large list of n elements (could be up to 100,000). each element in the list is an integer equivalent to the index of the list. After this I have to call Collections.shuffle on this list. My question is, which list implementation (either java collections or apache collections) should be used. My gut feeling is ArrayList can well be used here. All thoughts are appreciated. Thanks!

Thanks for the inputs. I think I am sticking to the ArrayList. I am currently using the ArrayList constructor with the initialCapacity param and I pass the size of the list. So if the original list is 100000, I create this new list with new ArrayList(100000); Hence I think I 开发者_StackOverflow社区don't have the create an array and do an asList since there won't be any resizing. Also, most of the apache collections Lists like GrowthList & LazyList do not implement RandomAccess. This for sure would slow down the shuffle (as per javadocs). FastArrayList does implement RandomAccess but apache has a note for this class saying "This class is not cross-platform. Using it may cause unexpected failures on some architectures".


ArrayList most probably has the least overhead per list element, so should be the best choice. It might be a worse choice if you frequently need to delete items in the middle of the list.


Quoted from the Collections.shuffle javadoc:

This method runs in linear time. If the specified list does not implement the RandomAccess interface and is large, this implementation dumps the specified list into an array before shuffling it, and dumps the shuffled array back into the list. This avoids the quadratic behavior that would result from shuffling a "sequential access" list in place.

So if you have no other needs i would go with ArrayList which implements RandomAccess.


Making an Integer array and then wrapping it with Arrays.asList gives you even less overhead than a regular ArrayList.

List<Integer> makeList(int size){
    if (size < 0) throw new IllegalArgumentException();
    Integer[] arr = new Integer[size];
    for (int i = 0; i < arr.length; ++i) arr[i] = i;
    List<Integer> list = Arrays.asList(arr);
    Collection.shuffle(list);
    return list;
}

You save one entire int worth of space (... which admittedly is absolutely nothing in this context), but it does perform fewer range checks than the "real" ArrayList, so accessing will be slightly faster. Probably nothing you'll notice, though :)


ArrayList<T> would probably be fine, yes - but what criteria are you using for "best" anyway? And just how good does it have to be anyway? What are your trade-offs between complexity and "goodness" in whatever those criteria are?


Javolution claims to have the fastest List implementation in java. But I couldn't find any shuffle implementation in this library so you will have to do it by hand.


Google's Guava library has some really nice primitive handling, including an Ints.asList() method returning a list that may be shuffled.

The Guava project is still at a preliminary stage of deployment, though the code has been carefully reviewed and heavily used at Google. You'll need to retrieve the code from SVN and build the com.google.common.primitive classes.


This is about your update to your question concerning FastArrayList.

FastArrayList does implement RandomAccess but apache has a note for this class saying "This class is not cross-platform. Using it may cause unexpected failures on some architectures".

The FastArrayList class (javadoc) is a concurrent list class. This is what the javadoc says:

A customized implementation of java.util.ArrayList designed to operate in a multithreaded environment where the large majority of method calls are read-only, instead of structural changes. When operating in "fast" mode, read calls are non-synchronized and write calls perform the following steps:

  1. Clone the existing collection
  2. Perform the modification on the clone
  3. Replace the existing collection with the (modified) clone

[...]

NOTE: If you are creating and accessing an ArrayList only within a single thread, you should use java.util.ArrayList directly (with no synchronization), for maximum performance.

NOTE: This class is not cross-platform [due to issues with fast mode and multiple threads]

Now your use-case (as described) is single-threaded. So:

  • The "cross-platform" issue is not relevant because it only affects multi-threaded use-cases.
  • The first "NOTE" says (clearly) that for a single-threaded application you are better off using an ArrayList.

In short, the "fast" in FastArrayList is relative to (say) doing this:

  List<String> myConcurrentlList = Collections.synchronizedList(new ArrayList<>());

Back to your original question. ArrayList is the simplest of the fast ways, and I doubt that any other List class will beat it. However, the following approach may be faster.

  String[] array = new String[...];
  // populate array
  // shuffle array ... using same algorithm as Collections.shuffle
  for (int i = array.length; i > 1; i--)
      swap(array, i - 1, rnd.nextInt(i));
  }
  List<String> list = Arrays.asList(array);

Why might that be faster? Because swap operations on an array will be faster than on an ArrayList.

Will it be faster overall? Hard to say. It depends on:

  • whether you creating / populating the array like that is / isn't extra work
  • whether the performance of list operations on an asList wrapper compared to an ArrayList ... and what operations you perform, etc.

My advice would be to beware of "premature optimization".


There is a new List implementation called GlueList which is very fast than ArrayList and LinkedList.

Disclaimer: I've created the implementation.


ArrayList will be the best List for this. As the array backing will be very efficent for swapping elements used in shuffle.

But if you are really chacing performance you may wish to consider using an int[] or a custom list based in int[] as with all the List and List standard implementations you will be boxing and unboxing ints to Integers.

This will not be an issue on the suffle as this will just be reordering pointers but you will be creating 100,000 objects when you may not need to. Assuming you know the size of your list before creation you can quite easily create a new List class that wraps a primitive array. If used as a java.util.List you will still need to box the return from any get method.


You can also use memory mapped file based list implementation . In such implementation the list is not fully present in memory but only a section of the huge list will be active in memory. If you are reaching heap space limitation (mostly in 32 bit jvm) you may be needing to make the list push off data seamlessly using a memory mapped file which will be faster than normal file I/O. One such implementation is described in this google code and explained in this link.

0

精彩评论

暂无评论...
验证码 换一张
取 消