开发者

Java - search collection of strings providing the first few characters

开发者 https://www.devze.com 2023-02-19 22:42 出处:网络
I have a collection of strings that I want to search providing only the first few characters. For example, consider the list of strings: [tom, tomaz, alice, tolstoy, john]. The string [to] would resu

I have a collection of strings that I want to search providing only the first few characters.

For example, consider the list of strings: [tom, tomaz, alice, tolstoy, john]. The string [to] would result in the list [tom, tomaz, tolstoy].

Performance is a major issue here and the list may be very large.

What is the best way to optimize th开发者_如何学Pythonis? Indexes? Sorting? How?

Thanks!


A trie is the universal solution, as has already been suggested but if you want a lightweight and relatively fast solution with no outside dependencies, simply put all your string into a TreeSet and use tailSet() to find the first element matching the prefix, then iterate through the tail set until you find a string that doesn't match. (Note: this could even be the first element if none of your strings match the prefix.)

If your list isn't bigger than a couple of thousand strings, this method is good enough in practice.


If you insist on using a list, your options are limited. It's simply not suited for this sort of thing.

The data structure that does exactly what you're trying to do is called a Trie (Wikipedia Entry)

A quick google brings up this java implementation from Duke University: http://www.cs.duke.edu/~ola/courses/cps108/fall96/joggle/trie/Trie.java


I recommend looking into tries to arrange your data if searching like this is a priority and it doesn't cause conflicts with your other requirements.


Look into Solr and Lucene. They do string searching by index or you could write your own as others have suggested.

http://lucene.apache.org/solr/


Assuming that your list is small enough to keep in memory, I would use a trie.

This will give you a lookup time proportional to the length of your prefix.


If you want to do this entirely in memory and without any dependencies, here's one quick option:

static int MAX_PREFIX = 3;
Map<String, List<String>> map = new HashMap<String, List<String>>();

public void addItem(String item) {
    for (int i = 0; i < MAX_PREFIX && i < item.length(); i++) {
        String prefix = item.substring(0, i);
        List<String> matches = map.get(prefix);
        if (matches == null) {
            matches = new ArrayList<String>();
            map.put(prefix, matches);
        }
        matches.add(item);
    }
}

public List<String> getMatches(String prefix) {
    List<String> matches = map.get(prefix);
    return matches == null ? Collections.<String>emptyList() : matches;
}

This is going to very fast since it's just a single Map lookup to go from your prefix String straight to a List<String> of your desired results. If your list is so large that it doesn't fit in memory then you'll need to consider going external. As mentioned you might want to look at Lucene for a local index. Or a database, simply index the column and do a LIKE 'prefix%' query.

0

精彩评论

暂无评论...
验证码 换一张
取 消