I am writing a program that has to strip pretty long strings of quite a lot of rubbish. I do that using regular express开发者_JS百科ions, and as my program is rather sensitive in terms of speed, I need to know which of the solutions is faster: Using a number of consecutive relatively simple regular expressions, or using a single but quite a complex one?
Best regards, Timofey.
You need to benchmark this stuff to be sure, and be sure to blog your results. I suspect one big regex will be quicker than many small ones, but I'm curious to see what you find out.
The java.util.regex.Pattern
class is pretty complex and I don't pretend to know what optimizations it performs. I do know regexes compile into a graph, so an obvious one would be to combine overlapping paths. The more variations you stuff into a single expression, the more such opportunities arise. It may also reduce the number of passes over the input data.
As many of you suggested, i tried it and here are the results:
After joining some of the RegExps that i used consequently into one, my execution time almost doubled (from 10 seconds for processing 1000 strings to 18 seconds for 1000 of the same strings).
So, basically, it turns out that sequentially removing as many symbols as you can, so as to make the remaining string as short as possible for the next regexp to clean, is faster than long regexps.
PS. Unfortunately, i couldn't post the regexps themselves, as they get corrupted by the code highlighter.
PPS: Here are some of the regexps i used sequentially:
s = s.replaceAll("<span STYLE=\"color:[\w|\d|\(|\)|\,]++\">", "");
s = s.replaceAll("</{0,1}\w++>", "");
s = s.replaceAll("<img SRC=\"/gif/", "");
s = s.replaceAll("(width|height)\s{0,}=\s{0,}\"{0,1}\d{1,}\"{0,1}", "");
s = s.replaceAll("align=\"\w++\"", "");
Then I combined them together by putting each one in brackets and placing the | between them.
精彩评论