开发者

Fast and efficient algorithm for Phrases Dictionary lookup?

开发者 https://www.devze.com 2023-04-04 08:04 出处:网络
Let\'s say I have a dictionary with a few million words and phrases. For each input sentence I want to identify (exact match) all words/phrases that the dictionary contains. The longest dictionary nam

Let's say I have a dictionary with a few million words and phrases. For each input sentence I want to identify (exact match) all words/phrases that the dictionary contains. The longest dictionary name should be preferred and no overlaps. For example:

Sentence: "Los Angeles Lakers visited Washington State last week"
Dictionary: {Los Angeles, Lakers, Los Angeles Lakers, Washington, State, Washington State University}

Then the sentence would be tagged as follows:
[Los Angeles Lakers] visited [Washington] [State] last week. 

One solution I can think of is storing the dictionary in memory with constant look-up time (e.g., hash based set) and then extracting all word n-grams from each sentence (n can be set to the number of words in the longest phrase in the dictionary) comparing each against the dictionary and keeping the longest ones that don't overlap. Is there a better solution? (because the n-gram generation can 开发者_高级运维be slow). Maybe trees can help?

Thanks!


You may want to consider something like a radix tree or a prefix tree, using whole words as part of your buildup. These are the kinds of trees that are natural to dictionary type problems.

Then simply split things into words, and perform a search of the trie. Depending on the expected length of the groupings, you would build from the front (reluctant) or from the back (greedy).


You might look at DAWG (Directed acyclic word graph). You would store the whole phrases as paths in DAWG. Then, you would start to match the sentence and find the longest phrase that matches as the longest path. You would then proceed with the rest unmatched sentence similarly. White space would need some special handling.


The following code performs non-overlapping tagging of a sentence giving priority to longer matches.

It treats a sentence as an array of string tokens.

It uses a "max_key_size" parameter (maximum size of a key in your dictionary) to avoid searching matches that will never happen. In your example, the resulting sentence is:

[[Los Angeles Lakers], visited, [Washington], [State], last, week]

Hope it helps:

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;

public class NonOverlappingTagging {

    private static ArrayList non_overlapping_tagging(String[] sentence, HashMap dict,int max_key_size) {
        ArrayList tag_sentence = new ArrayList();
        int N = sentence.length;

        if (max_key_size == -1) {
            max_key_size = N;
        }

        int i = 0;

        while (i < N) {

            boolean tagged = false;
            int j = Math.min(i + max_key_size, N); //avoid overflow

            while (j > i) {
                String[] literal_tokens = Arrays.copyOfRange(sentence, i, j);
                String literal = join(literal_tokens, " ");
                System.out.println(literal);

                if (dict.get(literal) != null) {
                    tag_sentence.add("["+literal+"]");
                    i = j;
                    tagged = true;
                }
                else {
                    j -= 1;
                }

            }

            if (!tagged) {
                tag_sentence.add(sentence[i]);
                i += 1;
            }
        }

        return tag_sentence;
    }

    private static String join(String[] sentence, String separator) {
        String result = "";
        for (int i = 0; i < sentence.length; i++) {
            String word = sentence[i];
            result += word + separator;
        }

        return result.trim();
    }

    public static void main(String[] args) {
        String[] sentence = {"Los", "Angeles", "Lakers", "visited", "Washington", "State", "last", "week"};
        HashMap <String, Integer>dict = new HashMap();
        dict.put("Los Angeles", 1);
        dict.put("Lakers", 1);
        dict.put("Los Angeles Lakers", 1);
        dict.put("Washington", 1);
        dict.put("State", 1);
        dict.put("Washington State University", 1);

        ArrayList tagging = non_overlapping_tagging(sentence, dict, -1);

        System.out.println(tagging);    
    }   
}
0

精彩评论

暂无评论...
验证码 换一张
取 消