I'm开发者_如何学编程 reading in lots of user entered data that represent phone numbers from files. They are all slightly entered in differently:
5555555555 555-555-5555 555-555/5555 1555-555-5555 etc...
How could I easily parse in all of these phone numbers in Python and produce a canonical output like: 555-555-5555?
Dive into Python has a section on parsing phone numbers
http://www.diveintopython.org/regular_expressions/phone_numbers.html
I'm not american, but this works with russian phone numbers... maybe it applies to american ones too?
- Discard all non-number characters
- Validate amount of the numbers left
- Insert several dashes in appropriate places
take only the numbers with a regex. then find out if they appended the 1 (NO area code starts with 1). if it's there, remove it otherwise, format the 10 digits the way you want.
import re
pnumber = re.sub("[^0-9]", "", input_number)
if pnumber[0] == 1:
pnumber = pnumber[1:] #strip 1st char if 1
#insert the dashes
if len(pnumber) == 10:
pnumber = "%s-%s-%s" % (pnumber[:3],pnumber[3:6],pnumber[6:])
else:
#throw error
After a little preparation with string.maketrans
, strings' translate method affords very fast and simple operation. I'm giving Python 2 code for plain strings (Python 3, and Unicode strings in Python 2, are a bit different -- ask if that's what you need):
The preparation (do once and for all, e.g. at module load time):
>>> import string
>>> allchars = string.maketrans('', '')
>>> nondigits = allchars.translate(allchars, string.digits)
The execution (turn any suitable string into the property formatted number):
>>> x='1555-555-5555'
>>> y=(x.translate(allchars, nondigits)).lstrip('1')
>>> assert len(y) == 10
>>> '%s-%s-%s' % (y[:3], y[3:6], y[6:])
'555-555-5555
Of course, you'll need to decide what to do when len(y)
does not equal 10 (just raise an exception as I'm doing here, or, what else). But, this would be needed for any other form of processing (regex or whatever) just as well. The translate
approach is really really fast and simple!-)
def extractNumber(s):
"""take a string phone number and extract it to the legal string"""
target = ""
for char in s:
try:
target += int(s)
except ValueError:
target += '-'
return target
Decide which formats you want to recognize, then create a regular expression matching each one grouping the different parts of the number (like area code, prefix etc). Finally use a substitution to generate the canonical output you desire.
Example:
to match
xxx-xxx-xxxx -> \d{3}-\d{3}-\d{4}
(xxx) xxx-xxxx -> \(\d{3}\) \d{3}-\d{4}
1-xxx-xxx-xxx -> 1-\d{3}-\d{3}-\d{4}
This ignores rules that restrict prefix and area code (the US doesn't allow area codes or prefixes that being with 0 or 1). You could try and be super smart and create one regular expression that matches everything but you will end up with a jumbled mess that is impossible to modify instead you should OR the patterns together to make them easier to modify in the future.
basic idea:
pattern = re.compile(r'\d{3}-\d{3}-\d{4}|\(\d{3}\) \d{3}-\d{4}|1-\d{3}-\d{3}-\d{4}')
with grouping added for canonical output
pattern = re.compile(r'(\d{3})-(\d{3})-(\d{4})|\((\d{3})\) (\d{3})-(\d{4})|1-(\d{3})-(\d{3})-(\d{4})')
then just run that against your inputs and for each phone number input you will have 3 matching groups, one for the area code, one for the prefix, and one for the suffix which you can output however you want. You need a basic understanding of regular expressions, but it shouldn't be too hard.
精彩评论