What is regular expression for detect开发者_如何学Pythoning a for loo and another one for detecting while loop.
want to detect for(--;--;--)
and while (--comparison operator --)
constructs.
You can't do this reliably with a regex. You need to parse the code with a proper parser.
You folks who are using \s
in Java to detect whitespace in Java code are making at least one and maybe several mistakes.
First of all, the Java compiler’s idea of whitespace in its own doesn’t line up with what \s
matches in Java. You may access the Java Character.isWhitespace()
through the \p{JavaWhitespace}
property.
Secondly, Java does not allow \s
to match Unicode whitespace; as implemented in the Java Pattern
class, \s
only matches ASCII whitespace. In fact, Java does not support any property that corresponds to Unicode whitespace.
Here’s a table showing some of the problem areas:
000A 0085 00A0 2029
J P J P J P J P
\s 1 1 0 1 0 1 0 1
\pZ 0 0 0 0 1 1 1 1
\p{Zs} 0 0 0 0 1 1 0 0
\p{Space} 1 1 0 1 0 1 0 1
\p{Blank} 0 0 0 0 0 1 0 0
\p{Whitespace} - 1 - 1 - 1 - 1
\p{javaWhitespace} 1 - 0 - 0 - 1 -
\p{javaSpaceChar} 0 - 0 - 1 - 1 -
What you’re looking at on the x-axis is four different code points:
U+000A: LINE FEED (LF)
U+0085: NEXT LINE (NEL)
U+00A0: NO-BREAK SPACE
U+2029: PARAGRAPH SEPARATOR
The y-axis has eight different regex tests, mostly properties. For each of those code points, there is both a J-results column for Java and a P-results column for Perl or any other PCRE-based regex engine.
It’s a big problem. Java is just messed up, giving answers that are "wrong" according to existing practice and also according to Unicode. Plus Java doesn’t even give you access to the real Unicode properties. For the record, these are the code points with the Unicode whitespace property:
% unichars '\pP{Whitespace}'
0009 CHARACTER TABULATION
000A LINE FEED (LF)
000B LINE TABULATION
000C FORM FEED (FF)
000D CARRIAGE RETURN (CR)
0020 SPACE
0085 NEXT LINE (NEL)
00A0 NO-BREAK SPACE
1680 OGHAM SPACE MARK
180E MONGOLIAN VOWEL SEPARATOR
2000 EN QUAD
2001 EM QUAD
2002 EN SPACE
2003 EM SPACE
2004 THREE-PER-EM SPACE
2005 FOUR-PER-EM SPACE
2006 SIX-PER-EM SPACE
2007 FIGURE SPACE
2008 PUNCTUATION SPACE
2009 THIN SPACE
200A HAIR SPACE
2028 LINE SEPARATOR
2029 PARAGRAPH SEPARATOR
202F NARROW NO-BREAK SPACE
205F MEDIUM MATHEMATICAL SPACE
3000 IDEOGRAPHIC SPACE
If you want, feel free to grab the unichars program and play around with it and its companion programs, uniprops and uninames. I haven’t added the Java-only properties yet, but I intend to. There are just too many nasty surprises like those described above.
For kicks and grins, would you believe there’s a \p{javaJavaIdentifierStart}
property in Java? I kid you not. But you wouldn’t believe the characters the compiler actually lets you use in identifiers; really you wouldn’t. Somebody wasn’t paying attention. Again. :(
You can parse almost anything with modern (PCRE-style) regex. However, parsing certain things correctly is often pathologically difficult. It's easy to build a small, terse regex to match only certain kinds of simply formatted for loops:
for\s*\([^;]*?;[^;]*?;[^)]*?\)
But what happens when you run into something like this?
int i = 0;
for(
String s = "for(0;1;2)";
s.indexOf(String.valueOf(i)) != -1;
i++ // increment the i variable ;-)
)
Better to use a full-blown purpose-built Java parser if you need 100% reliability. The java.net article Source Code Analysis Using Java 6 APIs gives a jumping-off point for one way to do reliable parsing of Java source code.
In reply to Taz's comment:
I did it with
.*for(.*;.*;.*).*
what could be wrong with this?
Assuming all the for-loops you want to match have:
- no linebreaks in them,
- no embedded/trailing comments
- no "string" or 'c'haracter literals in them
I think your pattern should be OK. You might want to allow for whitespace between the for
and the opening parenthesis:
.*for\s*(.*;.*;.*).*
However as tchrist points out in his answer to this question, \s*
is not a perfectly correct way to allow for whitespace in Java source code, as Java source code supports types of Unicode whitespace that \s
does not allow for. Again, if you need 100% reliability, a full Java source code parser is probably a better choice.
Make sure you turn off (or don't turn on) the "dot matches newline" option in your parser (e.g. DOTALL or Singleline). Otherwise your regex could match across multiple lines, which is likely to cause your regex to match incorrectly.
for ?\(.*?;.*?;.*?\)
while ?\(.+?\)
If the code's gonna be anything seriously complicated (Other than saying: Does this loop occur anywhere in the code) use a parser instead.
Why do we need these ? here. And I do need to detect that there is a comparison operator in while loop
If I were to leave the ? out then it would match for ( for(this;that;theother)
I updated the while loop to use +
I think that regular expressions given by JV contain extra question mark.
Here is my version:
for\s*\([^;]*;[^;]*;[^)]*\)
while\s*\(.*?\)
is correct but
while\s*\([^)]*\)
should be faster.
For loops are the easiest to detect:
for *\(.*;.*;.*)
While loops are a little trickier, as there are two ways to do it. If you want to use the format you specify above, this should work:
while *\(.*(<|>|<=|>=|==|!=).*\)
However, this does not detect while conditions that depend on the boolean value of a variable, nor the boolean result from a method, so this version would be a little simpler and match more:
while *\(.*\)
Regular expressions can only parse regular (Ch-3) languages. Java is not a regular language, it is at least context-free (Ch-2), maybe even context-sensitive (Ch-1).
精彩评论