Java: Getting the 500 most common words in a text via HashMap_问答_开发者

Java: Getting the 500 most common words in a text via HashMap

开发者 https://www.devze.com 2023-02-25 21:35 出处：网络

I\'m storing my wordcount into the value field of a HashMap, how can I then get the 500 top words in the text?

相关专题：hashmap

I'm storing my wordcount into the value field of a HashMap, how can I then get the 500 top words in the text?

 public ArrayList<String> topWords (int numberOfWordsToFind, ArrayList<String> theText) {

        //ArrayList<String> frequentWords = new ArrayList<String>();

        ArrayList<String> topWordsArray= new ArrayList<String>();

        HashMap<String,Integer> frequentWords = new HashMap<String,Integer>();

        int wordCounter=0;

        for (int i=0; i<theText.size();i++){



                  if(frequentWords.containsKey(theText.get(i))){

                       //find value and increment
                      wordCounter=frequentWords.get(theT开发者_高级运维ext.get(i));
                      wordCounter++;
                      frequentWords.put(theText.get(i),wordCounter);

                  }

                else {
                  //new word
                  frequentWords.put(theText.get(i),1);

                }
        }


        for (int i=0; i<theText.size();i++){

            if (frequentWords.containsKey(theText.get(i))){
                 // what to write here?
                frequentWords.get(theText.get(i));

            }
        }
        return topWordsArray;
    }

One other approach you may wish to look at is to think of this another way: is a Map really the right conceptual object here? It may be good to think of this as being a good use of a much-neglected-in-Java data structure, the bag. A bag is like a set, but allows an item to be in the set multiple times. This simplifies the 'adding a found word' very much.

Google's guava-libraries provides a Bag structure, though there it's called a Multiset. Using a Multiset, you could just call .add() once for each word, even if it's already in there. Even easier, though, you could throw your loop away:

Multiset<String> words = HashMultiset.create(theText);

Now you have a Multiset, what do you do? Well, you can call entrySet(), which gives you a collection of Multimap.Entry objects. You can then stick them in a List (they come in a Set), and sort them using a Comparator. Full code might look like (using a few other fancy Guava features to show them off):

Multiset<String> words = HashMultiset.create(theWords);

List<Multiset.Entry<String>> wordCounts = Lists.newArrayList(words.entrySet());
Collections.sort(wordCounts, new Comparator<Multiset.Entry<String>>() {
    public int compare(Multiset.Entry<String> left, Multiset.Entry<String> right) {
        // Note reversal of 'right' and 'left' to get descending order
        return right.getCount().compareTo(left.getCount());
    }
});
// wordCounts now contains all the words, sorted by count descending

// Take the first 50 entries (alternative: use a loop; this is simple because
// it copes easily with < 50 elements)
Iterable<Multiset.Entry<String>> first50 = Iterables.limit(wordCounts, 50);

// Guava-ey alternative: use a Function and Iterables.transform, but in this case
// the 'manual' way is probably simpler:
for (Multiset.Entry<String> entry : first50) {
    wordArray.add(entry.getElement());
}

and you're done!

Here you can find a guide how to sort a HashMap by the values. After the sorting you can just iterate over the first 500 entries.

Take a look at the TreeBidiMap provided by the Apache Commons Collections package. http://commons.apache.org/collections/api-release/org/apache/commons/collections/bidimap/TreeBidiMap.html

It allows you to sort the map according to both the key or the value set.

Hope it helps.

Zhongxian