I'm trying to extract snippets of dialogue from a book text. For example, if I have the string
"What's the matter with the flag?" inquired Captain MacWhirr. "Seems all right to me."
Then I want to extract "What's the matter with the flag?"
and "Seem's all right to me."
.
I found a regular expression to use here, which is "[^"\\]*(\\.[^"\\]*)*"
. This works great in Eclipse when I'm doing a Ctrl+F find regex on my book .txt file, but when I run the following code:
String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\""; Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
if(m.find())
System.out.println(m.group(1));
The only thing that prints is null
. So am I not conver开发者_运维百科ting the regex into a Java string properly? Do I need to take into account the fact that Java Strings have a \"
for the double quotes?
In a natural language text, it's not likely that "
is escaped by a preceding slash, so you should be able to use just the pattern "([^"]*)"
.
As a Java string literal, this is "\"([^\"]*)\""
.
Here it is in Java:
String regex = "\"([^\"]*)\"";
String bookText = "\"What's the matter with the flag?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
while (m.find()) {
System.out.println(m.group(1));
}
The above prints:
What's the matter with the flag?
Seems all right to me.
On escape sequences
Given this declaration:
String s = "\"";
System.out.println(s.length()); // prints "1"
The string s
only has one character, "
. The \
is an escape sequence present at the Java source code level; the string itself has no slash.
See also
- JLS 3.10.6 Escape Sequences for Character and String Literals
The problem with the original code
There's actually nothing wrong with the pattern per se, but you're not capturing the right portion. \1
isn't capturing the quoted text. Here's the pattern with the correct capturing group:
String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\"";
String bookText = "\"What's the matter?\" inquired Captain MacWhirr. \"Seems all right to me.\"";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(bookText);
while (m.find()) {
System.out.println(m.group(1));
}
For visual comparison, here's the original pattern, as a Java string literal:
String regex = "\"[^\"\\\\]*(\\\\.[^\"\\\\]*)*\""
^^^^^^^^^^^^^^^^^
why capture this part?
And here's the modified pattern:
String regex = "\"([^\"\\\\]*(?:\\\\.[^\"\\\\]*)*)\""
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
we want to capture this part!
As mentioned before, though: this complicated pattern isn't necessary for natural language text, which isn't likely to contain escaped quotes.
See also
- regular-expressions.info/Grouping and backreferences
精彩评论