I need to check that some text only contains lower-case letters a-z and a comma (",").
What is the best w开发者_如何学编程ay to do this in Python?
import re
def matches(s):
return re.match("^[a-z,]*$", s) is not None
Which gives you:
>>> matches("tea and cakes")
False
>>> matches("twiddledee,twiddledum")
True
You can optimise a bit with re.compile:
import re
matcher = re.compile("^[a-z,]*$")
def matches(s):
return matcher.match(s) is not None
import string
allowed = set(string.lowercase + ',')
if set(text) - allowed:
# you know it has forbidden characters
else:
# it doesn't have forbidden characters
Doing it with sets will be faster than doing it with for loops (especially if you want to check more than one text) and is all together cleaner than regexes for this situation.
an alternative that might be faster than two sets, is
allowed = string.lowercase + ','
if not all(letter in allowed for letter in text):
# you know it has forbidden characthers
here's some meaningless mtimeit
results. one
is the generator expression and two
is the set based solution.
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 3.98 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 4.39 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 3.51 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 7.7 usec per loop
You can see that the setbased one is significantly faster than the generator expression with a small expected alphabet and success conditions. the generator expression is faster with failures because it can bail. This is pretty much whats to be expected so it's interesting to see the numbers back it up.
another possibility that I forgot about is the hybrid approach.
not all(letter in allowed for letter in set(text))
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 5.06 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 6.71 usec per loop
it slows down the best case-ish but speeds up the worst case-ish. All in all, you'd have to test the different possibilities over a sample of your expected input. the broader the sample, the better.
import re
if not re.search('[^a-z\,]', yourString):
# True: contains only a-z and comma
# False: contains also something else
Not sure what do you mean with "contain", but this should go in your direction:
reobj = re.compile(r"[a-z,]+")
match = reobj.search(subject)
if match:
result = match.group()
else
result = ""
Just:
def alllower(s):
if ',' in s:
s=s.replace(',','a')
return s.isalpha() and s.islower()
with most efficient and simple.
or in one line:
lambda s:s.isalpha() or (',' in s and s.replace(',','a').isalpha()) and s.islower()
#!/usr/bin/env python
import string
text = 'aasdfadf$oih,234'
for letter in text:
if letter not in string.ascii_lowercase and letter != ',':
print letter
characters a -z are represented by bytes 97 - 122 and ord(char) returns the byte value of the character. Reading the file in binary and making the match should suffice.
f = open("myfile", "rb")
retVal = False
lowerAlphabets = range(97, 123)
try:
byte = f.read(1)
while byte != "":
# Do stuff with byte.
byte = f.read(1)
if byte:
if ord(byte) not in lowerAlphabets:
retVal = True
break
finally:
f.close()
if retVal:
print "characters not from a - z"
else:
print "characters from a - z"
精彩评论