I'm writing a program in .net where the user may provide a large number of regular expressions. For a given string, I need to figure out which regular expression matches that string (if more than one matches, I just need the first one that matches). However, if there are a large number of regular expressions this operation 开发者_开发技巧can take a very long time.
I was somewhat hoping there would be something similar to flex (The Fast Lexical Analyzer (not Adobe Flex)) for .net that would allow me to specify a large number of regular expressions yet quickly (O(n) according to Wikipedia for n = len(input string)) figure out which regular expression matches.
Also, I would prefer not to implement my own regular expression engine :).
Find the biggest chunk of constant text in each regex (if above a certain threshold length) and use the Karp-Rabin algorithm to search for any of those strings simultaneously. For each match, run that regex to see if the whole thing matches. For each regex not included in the multi string search, search that regex directly.
This should give you good performance for a large number of regular expressions if they have reasonable-length constant substrings, assuming you have preprocessing time available for the regular expressions.
What? Even testing if a single regular expression matches cannot be done in O(n) time in general. Where did you get this information from? What feature is this in Flex? I am sure it must be some limited form of regular expressions, not for arbitrary .NET regular expressions.
To handle arbitrary regular expressions the simple way is to keep your regular expressions stored in a List
and simply iterate over each regular expression one by one until you find one that matches.
A quick web search shows that there exists a lex like tool named C#Lex. But since I do not use .NET or C#, I cannot say whether it is good, and if it is useful for you.
For Java, I found JLex and JFlex, both of which generate source code. Using these seems only reasonable if the regular expressions are literally compiled "off-line" (outside the application), and then incorporated into your application classpath. The .NET version probably behaves similarly.
精彩评论