开发者

Python text validation: a-z and comma (",")

开发者 https://www.devze.com 2023-01-17 06:44 出处:网络
I need to check that some text only contains lower-case letters a-z and a comma (\",\"). What is the best w开发者_如何学编程ay to do this in Python?import re

I need to check that some text only contains lower-case letters a-z and a comma (",").

What is the best w开发者_如何学编程ay to do this in Python?


import re
def matches(s):
    return re.match("^[a-z,]*$", s) is not None

Which gives you:

>>> matches("tea and cakes")
False
>>> matches("twiddledee,twiddledum")
True

You can optimise a bit with re.compile:

import re
matcher = re.compile("^[a-z,]*$")
def matches(s):
    return matcher.match(s) is not None


import string

allowed = set(string.lowercase + ',')
if set(text) - allowed:
   # you know it has forbidden characters
else:
   # it doesn't have forbidden characters 

Doing it with sets will be faster than doing it with for loops (especially if you want to check more than one text) and is all together cleaner than regexes for this situation.

an alternative that might be faster than two sets, is

allowed = string.lowercase + ','
if not all(letter in allowed for letter in text):
    # you know it has forbidden characthers

here's some meaningless mtimeit results. one is the generator expression and two is the set based solution.

$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 3.98 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 4.39 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.two("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 3.51 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.one("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 7.7 usec per loop

You can see that the setbased one is significantly faster than the generator expression with a small expected alphabet and success conditions. the generator expression is faster with failures because it can bail. This is pretty much whats to be expected so it's interesting to see the numbers back it up.

another possibility that I forgot about is the hybrid approach.

not all(letter in allowed for letter in set(text))

$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfasasdfadsfasdfasdfdaf")'
100000 loops, best of 3: 5.06 usec per loop
$ python -mtimeit -s'import scratch3' 'scratch3.three("asdfas2423452345sdfadf34")'
100000 loops, best of 3: 6.71 usec per loop

it slows down the best case-ish but speeds up the worst case-ish. All in all, you'd have to test the different possibilities over a sample of your expected input. the broader the sample, the better.


import re

if not re.search('[^a-z\,]', yourString):
    # True: contains only a-z and comma
    # False: contains also something else


Not sure what do you mean with "contain", but this should go in your direction:

reobj = re.compile(r"[a-z,]+")
match = reobj.search(subject)
if match:
    result = match.group()
else
    result = ""


Just:

def alllower(s):
    if ',' in s:
        s=s.replace(',','a')
    return s.isalpha() and s.islower()

with most efficient and simple.

or in one line:

lambda s:s.isalpha() or (',' in s and s.replace(',','a').isalpha()) and s.islower()


#!/usr/bin/env python

import string

text = 'aasdfadf$oih,234'

for letter in text:
    if letter not in string.ascii_lowercase and letter != ',':
        print letter


characters a -z are represented by bytes 97 - 122 and ord(char) returns the byte value of the character. Reading the file in binary and making the match should suffice.

f = open("myfile", "rb")
retVal = False
lowerAlphabets = range(97, 123)
try:
    byte = f.read(1)
    while byte != "":
        # Do stuff with byte.
        byte = f.read(1)
        if byte:
            if ord(byte) not in lowerAlphabets:
                retVal = True
                break

finally:
    f.close()
    if retVal:
        print "characters not from a - z"
    else:
        print "characters from a - z"
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号