How can I make this Python2.6 function work with Unicode?_问答_开发者

I've got this function, which I modified from material in chapter 1 of the online NLTK book. It's been very useful to me but, despite reading the chapter on Unicode, I feel just as lost as before.

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    tokens = nltk.wordpunct_tokenize(rawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

When I tried it the other day on Also Sprach Zarathustra, it clobbered words with an umlat over the o's and u's. I'm sure some of you will know why that happened. I'm also sure that it's quite easy to fix. I know that it just has to do with calling a function that re-encodes the tokens into unicode strings. If so, that it seems to me it might not happen inside that function definition at all, but here, where I prepare to write to file:

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = '\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

I heard that what I had to do was encode the string into unicode after reading it from the file. I tried amending the function like so:

def openbookreturnvocab(book):
    fileopen = open(book)
    rawness = fileopen.read()
    unirawness = rawness.decode('utf-8')
    tokens = nltk.wordpunct_tokenize(unirawness)
    nltktext = nltk.Text(tokens)
    nltkwords = [w.lower() for w in nltktext]
    nltkvocab = sorted(set(nltkwords))
    return nltkvocab

But that brought this error, when I used it on Hungarian. When I used it on German, I had no errors.

>>> import bookroutines
>>> elles1 = bookroutines.openbookre开发者_JS百科turnvocab("lk1-les1")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 9, in openbookreturnvocab
    nltktext = nltk.Text(tokens)
  File "/usr/lib/pymodules/python2.6/nltk/text.py", line 285, in __init__
    self.name = " ".join(map(str, tokens[:8])) + "..."
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 4: ordinal not in range(128)

I fixed the function that files the data like so:

def jotindex(jotted, filename, readmethod):
    filemydata = open(filename, readmethod)
    jottedf = u'\n'.join(jotted)
    filemydata.write(jottedf)
    filemydata.close()
    return 0

However, that brought this error, when I tried to file the German:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "bookroutines.py", line 23, in jotindex
    filemydata.write(jottedf)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 414: ordinal not in range(128)
>>>

...which is what you get when you try to write the u'\n'.join'ed data.

>>> jottedf = u'/n'.join(elles1)
>>> filemydata.write(jottedf)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 504: ordinal not in range(128)

For each string that you read from your file, you can convert them to unicode by calling rawness.decode('utf-8'), if you have the text in UTF-8. You will end up with unicode objects. Also, I don't know what "jotted" is, but you may want to make sure it's a unicode object and use u'\n'.join(jotted) instead.

Update:

It appears that the NLTK library doesn't like unicode objects. Fine, then you have to make sure that you are using str instances with UTF-8 encoded text. Try using this:

tokens = nltk.wordpunct_tokenize(unirawness)
nltktext = nltk.Text([token.encode('utf-8') for token in tokens])

and this:

jottedf = u'\n'.join(jotted)
filemydata.write(jottedf.encode('utf-8'))

but if jotted is really a list of UTF-8-encoded str, then you don't need this and this should be enough:

jottedf = '\n'.join(jotted)
filemydata.write(jottedf)

By the way, it looks as though NLTK isn't very cautious with respect to unicode and encoding (at least, the demos). Better be careful and check that it has processed your tokens correctly. Also, and this may have caused the fact that you get errors with Hungarian text and not German text, check your encodings.