I need to parse a search query with a "Google-like" syntax (but simpler, since I don't need parenthesis, operator nesting and such). An example string might be:
TAG1: a,b,c TAG2: 123 TAG3: a,45,44,b
So, simply put, I need to recognize tokens which look like a TAG (i.e "color", "name", "age") followed by : and by a single "word" or a list of comma separated words I tried with some regex but if a user makes mistakes with the 开发者_如何学编程syntax (like typing an extra comma, or forgetting a value after a tag - color: shape:) the parsing fails. I don't really know if this is my fault (I'm far from being an expert with regex) or if going with a parser like ANTLR would be a better choice. Anyway, I'm opened to any kind of suggestion (I'm coding in java - I know the language has nothing to do with it, but maybe there are some tools that may help)
Thanks for your suggestions...
Given a string like "TAG1: a,b,c TAG2: 123 TAG3: a,45,44,b"
Pattern tokens = Pattern.compile( "([a-zA-Z0-9]+):\\s*(\\w+(?:,?\\w+)*)" );
Matcher m = tokens.matcher( myString );
while( m.find() ) {
System.out.println( "tag:" + m.group(1) + " value:" + m.group(2) );
}
That catches all of your cases and makes sure there is a certain well-formedness. Let me know if there is something I'm missing from your question.
Edit 1: To cover your other case you could do something like:
Pattern tokens = Pattern.compile( "([a-zA-Z0-9]+):\\s*(\\w+(?:[ ,]+?\\w+)*)(?=\\s+[a-zA-Z0-9]+:)|([a-zA-Z0-9]+):\\s*(\\w+(?:[ ,]+?\\w+)*)" );
And then check for groups 3 and 4 also.
Still, this regex is getting overly ambitious... though I'm not convinced a full-up parser would make your life that much easier in this case.
An alternative is to break it down one level at a time (which is what a parser would do anyway):
Pattern main = Pattern.compile( "([a-zA-Z0-9]+):" );
Matcher m = main.matcher(myString);
int lastStart = 0;
while( m.find() ) {
if( lastStart != 0 ) {
processToken( myString.substring(lastStart, m.start()) );
}
lastStart = m.start();
}
processToken( myString.substring(lastStart) );
Or something like that. It's similar to force an & sort of separator but it's taking into account the implicit separation that is your token syntax.
You might want to check out the Lucene QueryParser, you might be able to use it for your needs. It uses a javacc generated parser.
JavaCC
Lucene QueryParser
Thanks for your answers. PSpeed, the problem with your regexp is that if an user puts an extra space in the comma separated list (i.e. "TAG1: 1, 4") the match fails. Sorry, maybe I didn't explain very well.
Anyway, since I can change the syntax, I decided a separator would make everything easier and came up with the following regex for it.
String testString = "TAG1: a,b,c & TAG2: dddd, dddd & TAG3: 123"
Pattern pattern = Pattern.compile("(?:\\s+|^)([A-Z]+:)\\s*(,*\\s*\\w+\\s*,*)+\\s*(?:$|&)");
But seeing as it fails with simple mistakes (what happens if the user forgets a &?), I'm starting to doubt if regex are the perfect tool for this task...
精彩评论