开发者

Efficient algorithm to find a set of matches in a short text

开发者 https://www.devze.com 2023-01-28 21:39 出处:网络
Input: A relatively short (100-1000 characters, usually) text. A fixed list of about 5000 expressions given in advance, most of them up 10-20 characters long, some of them containing others as subex

Input:

  1. A relatively short (100-1000 characters, usually) text.
  2. A fixed list of about 5000 expressions given in advance, most of them up 10-20 characters long, some of them containing others as subexpressions (e.g. "Try" and "Try Again").

Note - Only the first input changes, the second one is considered a constant, and is available for preprocessing.

Desired Output:

Identify all the matches of the expressions from item 2 inside the text. If there are match ambiguities, take the greedy match if possible开发者_开发问答.

Runtime should be relatively fast, though no strict performance requirements. A brute force attempt might suffice here.

What is a good algorithm for this? Are suffix trees useful here? How about going over all the expressions and putting them in a hashtable? Also note that I'm interested in practical solutions, so ease of implementation can be more useful than a super efficient algorithm...


Take a look at the Aho–Corasick algorithm.


The general "algorithm" assuming unlimited storage, for optimising this is to build up a tree on the data based on characters allowing you to search for your pattern recursively. In the tree index as you build, you traverse down until you reach a "unique" point and the "leaf" gives the location of where that unique occurrence is.

In the above paragraph for example the word "index" appears once. If the tree is built one character at a time then the tree path we follow would start with the "i" character then "in". If it is case sensitive there are just 3 occurrences (assuming, optimising and index). When we next search for 'd' we hit our unique result. Of course we could start our search first with the space, then the i and then the n and we'd follow a different path.

You could also make the tree case-insensitive, and you could use a "nybble" rather than a byte at each branch point.

0

精彩评论

暂无评论...
验证码 换一张
取 消