I have a unicode string in Python and basically 开发者_C百科need to go through, character by character and replace certain ones based on a list of rules. One such rule is that a
is changed to ö
if a
is after n
. Also, if there are two vowel characters in a row, they get replaced by one vowel character and :
. So if I have the string "natarook"
, what is the easiest and most efficient way of getting "nötaro:k"
? Using Python 2.6 and CherryPy 3.1 if that matters.
edit: two vowels in a row does mean the same vowels (oo, aa, ii)
# -*- coding: utf-8 -*-
def subpairs(s, prefix, suffix):
def sub(i, sentinal=object()):
r = prefix.get(s[i:i+2], sentinal)
if r is not sentinal: return r
r = suffix.get(s[i-1:i+1], sentinal)
if r is not sentinal: return r
return s[i]
s = '\0'+s+'\0'
return ''.join(sub(i) for i in xrange(1,len(s)))
vowels = [(v+v, u':') for v in 'aeiou']
prefix = {}
suffix = {'na':u'ö'}
suffix.update(vowels)
print subpairs('natarook', prefix, suffix)
# prints: nötaro:k
prefix = {'na':u'ö'}
suffix = dict(vowels)
print subpairs('natarook', prefix, suffix)
# prints: öataro:k
focus on easy and correct first, then consider efficiency if profiling indicates its a bottleneck.
The simple approach is:
prev = None
for ch in string:
if ch == 'a':
if prev == 'n':
...
prev = ch
"I know, I'll use regular expressions!"
But seriously, regexes are really good for string manipulation.
You could write one per rule, like so:
s/na/nö/g
s/([aeiou])$1/$1:/g
Or you could generate them at runtime from some other source which lists them all.
Given your rules, I'd say you really want a simple state machine. Hmm, on second thought, maybe not; you can just look back in the string as you go.
I have a unicode string in Python and basically need to go through, character by character and replace certain ones based on a list of rules. One such rule is that a is changed to ö if a is after n. Also, if there are two vowel characters in a row, they get replaced by one vowel character and :. So if I have the string , what is the easiest and most efficient way of getting "nötaro:k"? Using Python 2.6 and CherryPy 3.1 if that matters.
vowel_set = frozenset(['a', 'e', 'i', 'o', 'u', 'ö'])
def fix_the_string(s):
lst = []
for i, ch in enumerate(s):
if ch == 'a' and lst and lst[-1] == 'n':
lst.append('ö')
else if ch in vowel_set and lst and lst[-1] in vowel_set:
lst[-1] = 'a' # "replaced by one vowel character", not sure what you want
lst.append(':')
else
lst.append(ch)
return "".join(lst)
print fix_the_string("natarook")
EDIT: Now that I saw the answer by @Anon. I think that's the simplest approach. This might actually be faster once you get a whole bunch of rules in play, as it makes one pass over the string; but maybe not, because the regexp stuff in Python is fast C code.
But simpler is better. Here is actual Python code for the regexp approach:
import re
pat_na = re.compile(r'na')
pat_double_vowel = re.compile(r'([aeiou])[aeiou]')
def fix_the_string(s):
s = re.sub(pat_na, r'nö', s)
s = re.sub(pat_double_vowel, r'\1:', s)
return s
print fix_the_string("natarook") # prints "nötaro:k"
It might be simpler to do with a handmade list of regular expressions, rather than progmatically gererating them. I recommend the following code.
import re
# regsubs is a dictionary of regular expressions as keys,
# and the replacement regexps as values
regsubs = {'na':u'nö',
'([aeiou])\\1': '\\1:'}
def makesubs(s):
for pattern, repl in regsubs.iteritems():
s = re.sub(pattern, repl, s)
return s
print makesubs('natarook')
# prints: nötaro:k
精彩评论