开发者

What it is the best performance way to grab duplicates from a large Set<String>?

开发者 https://www.devze.com 2023-03-20 11:59 出处:网络
I have a large Set<String> that contains many words, say: \"aaa, cCc, dDD, AAA, bbB, BBB, AaA, CCc, ...\"

I have a large Set<String> that contains many words, say:

"aaa, cCc, dDD, AAA, bbB, BBB, AaA, CCc, ..."

I want to group all duplicate words from the Set ignoring the case sensitivity of the words then save them in a Vector<Vector<String>> or whateve开发者_JAVA百科r, so each Vector<String> item will contain the group of similar words, like this :

Vector<String>: aaa, AAA, AaA, ...

Vector<String>: cCc, CCc, ...

Vector<String>: bbB, BBB, ...

I care about the performance as this Set contain many words.


If you truly care about performance you would not use Vector. As for the sorting problem one solution would be to use the TreeMap or TreeSet object and create a Comparator that does the equality (sorting) you want.

The instantiation could be:

new TreeMap<String,LinkedList<String>>(new Comparator<String>() {

   // comparator here

});

Usage:

LinkedList<String> entry = map.get(nextKey);
if (entry == null) {
  entry = new LinkedList<String>()
  map.put(nextKey, entry);
}
entry.add(nextKey);


I would create a HashMap<String, Vector<String>> hashMap. Next, for each 'string' in your set

if (!hashMap.containsKey(string.toLowerCase()){
     Vector v = new Vector();
     v.add(string);
      hashMap.put(string.toLowerCase(), v);
} else { 
     hashMap.get(string.toLowerCase()).add(string);
}

At the end, create a Vector of vectors if needed, or work with the hashmap.valueSet()


If you can choose Set implementation you can use TreeSet with Comparator that compares strings ignoring case. Then you will be able to iterate over sorted list and easily group duplicates.


This iterates over the input set once and I doubt you can get much faster than that. Swapping the ArrayLists for LinkedLists may trade locality for less copying, which may be an performance gain, but I doubt it. Here's the code:

Set<String> input = new HashSet<String>(Arrays.asList(
    "aaa", "cCc", "dDD", "AAA", "bbB", "BBB", "AaA", "CCc"));

Map<String, List<String>> tmp = new HashMap<String, List<String>>();

for (String s : input) {
    String low = s.toLowerCase();
    List<String> l = tmp.get(low);

    if (l == null) {
        l = new ArrayList<String>();
        tmp.put(low, l);
    }

    l.add(s);
}

final List<List<String>> result = new ArrayList<List<String>>(tmp.values());
0

精彩评论

暂无评论...
验证码 换一张
取 消