Python 3 has a string method called str.isidentifier
How can I get similar functionality in Python 2.6, sh开发者_开发百科ort of rewriting my own regex, etc.?
the tokenize module defines a regexp called Name
import re, tokenize, keyword
re.match(tokenize.Name + '$', somestr) and not keyword.iskeyword(somestr)
Invalid Identifier Validation
All of the answers in this thread seem to be repeating a mistake in the validation which allows strings that are not valid identifiers to be matched like ones.
The regex patterns suggested in the other answers are built from tokenize.Name
which holds the following regex pattern [a-zA-Z_]\w*
(running python 2.7.15) and the '$' regex anchor.
Please refer to the official python 3 description of the identifiers and keywords (which contains a paragraph that is relevant to python 2 as well).
Within the ASCII range (U+0001..U+007F), the valid characters for identifiers are the same as in Python 2.x: the uppercase and lowercase letters A through Z, the underscore _ and, except for the first character, the digits 0 through 9.
thus 'foo\n' should not be considered as a valid identifier.
While one may argue that this code is functional:
>>> class Foo():
>>> pass
>>> f = Foo()
>>> setattr(f, 'foo\n', 'bar')
>>> dir(f)
['__doc__', '__module__', 'foo\n']
>>> print getattr(f, 'foo\n')
bar
As the newline character is indeed a valid ASCII character, it is not considered to be a letter. Further more, there is clearly no practical use of an identifer that ends with a newline character
>>> f.foo\n
SyntaxError: unexpected character after line continuation character
The str.isidentifier
function also confirms this is an invalid identifier:
python3 interpreter:
>>> print('foo\n'.isidentifier())
False
The $
anchor vs the \Z
anchor
Quoting the official python2 Regular Expression syntax:
$
Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.
This results in a string that ends with a newline to match as a valid identifier:
>>> import tokenize
>>> import re
>>> re.match(tokenize.Name + '$', 'foo\n')
<_sre.SRE_Match at 0x3eac8e0>
>>> print m.group()
'foo'
The regex pattern should not use the $
anchor but instead \Z
is the anchor that should be used.
Quoting once again:
\Z
Matches only at the end of the string.
And now the regex is a valid one:
>>> re.match(tokenize.Name + r'\Z', 'foo\n') is None
True
Dangerous Implications
See Luke's answer for another example how this kind of weak regex matching could potentially in other circumstances have more dangerous implications.
Further Reading
Python 3 added support for non-ascii identifiers see PEP-3131.
re.match(r'[a-z_]\w*$', s, re.I)
should do nicely. As far as I know there isn't any built-in method.
Good answers so far. I'd write it like this.
import keyword
import re
def isidentifier(candidate):
"Is the candidate string an identifier in Python 2.x"
is_not_keyword = candidate not in keyword.kwlist
pattern = re.compile(r'^[a-z_][a-z0-9_]*$', re.I)
matches_pattern = bool(pattern.match(candidate))
return is_not_keyword and matches_pattern
In Python < 3.0 this is quite easy, as you can't have unicode characters in identifiers. That should do the work:
import re
import keyword
def isidentifier(s):
if s in keyword.kwlist:
return False
return re.match(r'^[a-z_][a-z0-9_]*$', s, re.I) is not None
I've decided to take another crack at this, since there have been several good suggestions. I'll try to consolidate them. The following can be saved as a Python module and run directly from the command-line. If run, it tests the function, so is provably correct (at least to the extent that the documentation demonstrates the capability).
import keyword
import re
import tokenize
def isidentifier(candidate):
"""
Is the candidate string an identifier in Python 2.x
Return true if candidate is an identifier.
Return false if candidate is a string, but not an identifier.
Raises TypeError when candidate is not a string.
>>> isidentifier('foo')
True
>>> isidentifier('print')
False
>>> isidentifier('Print')
True
>>> isidentifier(u'Unicode_type_ok')
True
# unicode symbols are not allowed, though.
>>> isidentifier(u'Unicode_content_\u00a9')
False
>>> isidentifier('not')
False
>>> isidentifier('re')
True
>>> isidentifier(object)
Traceback (most recent call last):
...
TypeError: expected string or buffer
"""
# test if candidate is a keyword
is_not_keyword = candidate not in keyword.kwlist
# create a pattern based on tokenize.Name
pattern_text = '^{tokenize.Name}$'.format(**globals())
# compile the pattern
pattern = re.compile(pattern_text)
# test whether the pattern matches
matches_pattern = bool(pattern.match(candidate))
# return true only if the candidate is not a keyword and the pattern matches
return is_not_keyword and matches_pattern
def test():
import unittest
import doctest
suite = unittest.TestSuite()
suite.addTest(doctest.DocTestSuite())
runner = unittest.TextTestRunner()
runner.run(suite)
if __name__ == '__main__':
test()
What I am using:
def is_valid_keyword_arg(k):
"""
Return True if the string k can be used as the name of a valid
Python keyword argument, otherwise return False.
"""
# Don't allow python reserved words as arg names
if k in keyword.kwlist:
return False
return re.match('^' + tokenize.Name + '$', k) is not None
All solutions proposed so far do not support Unicode or allow a number in the first char if run on Python 3.
Edit: the proposed solutions should only be used on Python 2, and on Python3 isidentifier
should be used. Here is a solution that should work anywhere:
re.match(r'^\w+$', name, re.UNICODE) and not name[0].isdigit()
Basically, it tests whether something consists of (at least 1) characters (including numbers), and then it checks that the first char is not a number.
精彩评论