I'm trying to parse an HTTP GET request to determine if the url contains any of a number of file types. I开发者_如何学运维f it does, I want to capture the entire request. There is something I don't understand about ORing.
The following regular expression only captures part of it, and only if .flv is the first int the list of ORd values.
(I've obscured the urls with spaces because Stackoverflow limits hyperlinks)
regex:
GET.*?(\.flv)|(\.mp4)|(\.avi).*?
test text:
GET http: // foo.server.com/download/0/37/3000016511/.flv?mt=video/xy
match output:
GET http: // foo.server.com/download/0/37/3000016511/.flv
I don't understand why the .*? at the end of the regex isnt callowing it to capture the entire text. If I get rid of the ORing of file types, then it works.
Here is the test code in case my explanation doesn't make sense:
public static void main(String[] args) {
// TODO Auto-generated method stub
String sourcestring = "GET http: // foo.server.com/download/0/37/3000016511/.flv?mt=video/xy";
Pattern re = Pattern.compile("GET .*?\\.flv.*"); // this works
//output:
// [0][0] = GET http :// foo.server.com/download/0/37/3000016511/.flv?mt=video/xy
// the match from the following ends with the ".flv", not the entire url.
// also it only works if .flv is the first of the 3 ORd options
//Pattern re = Pattern.compile("GET .*?(\\.flv)|(\\.mp4)|(\\.avi).*?");
// output:
//[0][0] = GET http: // foo.server.com/download/0/37/3000016511/.flv
// [0][1] = .flv
// [0][2] = null
// [0][3] = null
Matcher m = re.matcher(sourcestring);
int mIdx = 0;
while (m.find()){
for( int groupIdx = 0; groupIdx < m.groupCount()+1; groupIdx++ ){
System.out.println( "[" + mIdx + "][" + groupIdx + "] = " + m.group(groupIdx));
}
mIdx++;
}
} }
You have your grouping wrong. The |
needs to be inside the parentheses:
GET.*?(\.flv|\.mp4|\.avi).*?
I'm also not sure why you have the ?
on the end of the final .*?
. In most languages, the ? here makes the * non-greedy, so it matches as few characters as possible, while not preventing the pattern from matching. In this case that would mean it matches no characters, since nothing follows it, so you probably want to remove that final ?.
GET .*?(\.flv|\.mp4|\.avi).*
First of all, your regex reads like this:
GET.*?(\.flv) | (\.mp4) | (\.avi).*?
(spaces added for clarity). Try it like this:
GET.*?(\.flv|\.mp4|\.avi).*?
精彩评论