I am writing a parser using ply that needs to identify FORTRAN string literals. These are quoted with single quotes with the escape 开发者_如何学Gocharacter being doubled single quotes. i.e.
'I don''t understand what you mean'
is a valid escaped FORTRAN string.
Ply takes input in regular expression. My attempt so far does not work and I don't understand why.
t_STRING_LITERAL = r"'[^('')]*'"
Any ideas?
A string literal is:
- An open single-quote, followed by:
- Any number of doubled-single-quotes and non-single-quotes, then
- A close single quote.
Thus, our regex is:
r"'(''|[^'])*'"
You want something like this:
r"'([^']|'')*'"
This says that inside of the single quotes you can have either double quotes or a non-quote character.
The brackets define a character class, in which you list the characters that may or may not match. It doesn't allow anything more complicated than that, so trying to use parentheses and match a multiple-character sequence ('')
doesn't work. Instead your [^('')]
character class is equivalent to [^'()]
, i.e. it matches anything that's not a single quote or a left or right parenthesis.
It's usually easy to get something quick-and-dirty for parsing particular string literals that are giving you problems, but for a general solution you can get a very powerful and complete regex for string literals from the pyparsing module:
>>> import pyparsing
>>> pyparsing.quotedString.reString
'(?:"(?:[^"\\n\\r\\\\]|(?:"")|(?:\\\\x[0-9a-fA-F]+)|(?:\\\\.))*")|(?:\'(?:[^\'\\n\\r\\\\]|(?:\'\')|(?:\\\\x[0-9a-fA-F]+)|(?:\\\\.))*\')'
I'm not sure about significant differences between FORTRAN's string literals and Python's, but it's a handy reference if nothing else.
import re
ch ="'I don''t understand what you mean' and you' ?"
print re.search("'.*?'",ch).group()
print re.search("'.*?(?<!')'(?!')",ch).group()
result
'I don'
'I don''t understand what you mean'
精彩评论