开发者

How to convert an UTF string with scandinavian characters to ASCII?

开发者 https://www.devze.com 2022-12-24 08:34 出处:网络
I would like to convert this string foo_utf = u\'nästy chäräctörs with å and co.\' # unicode into this

I would like to convert this string

foo_utf = u'nästy chäräctörs with å and co.' # unicode

into this

foo_ascii = 'nästy chäräctörs with å and co.' # ASCII

.

Any idea how to do this in Python (2.6)? I found unicodedata mo开发者_如何学Godule but I have no idea how to do the transformation.


I don't think you can. Those "nästy chäräctörs" can't be encoded as ASCII, so you'll have to pick a different encoding (UTF-8 or Latin-1 or Windows-1252 or something).


Try the encode method of string.

>>> u'nästy chäräctörs with å and co.'.encode('latin-1')
'n\xe4sty ch\xe4r\xe4ct\xf6rs with \xe5 and co.'


There are several options in the codecs module in python's stdlib, depending on how you want the extended characters handled:

>>> import codecs
>>> u = u'nästy chäräctörs with å and co.'
>>> encode = codecs.get_encoder('ascii')
>>> encode(u) 
'
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 1: ordinal not in range(128)
>>> encode(u, 'ignore')
('nsty chrctrs with  and co.', 31)
>>> encode(u, 'replace')
('n?sty ch?r?ct?rs with ? and co.', 31)
>>> encode(u, 'xmlcharrefreplace')
('n&#228;sty ch&#228;r&#228;ct&#246;rs with &#229; and co.', 31)
>>> encode(u, 'backslashreplace')
('n\\xe4sty ch\\xe4r\\xe4ct\\xf6rs with \\xe5 and co.', 31)

Hopefully one of those will meet your needs. There's more information available in the Python codecs module documentation.


This really is a Django question, and not a python one. if the string is in one of your .py files, make sure that you have the following line on top of your file: -*- coding: utf-8 -*-

furthermore, your string needs to be of type "unicode" (u'foobar')

And then make sure that your html page works in unicode:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

That should do the whole trick. No encoding/decoding etc. necessary, just make sure that everything is unicode, and you are on the safe side.


You can also use the unicodedata module (http://docs.python.org/library/unicodedata.html) provided in python to convert a lot of unicode values into an Ascii variant. IE fix the different "s and such. Follow that up by the encode() method and you can completely clean up a string.

The method you mainly what out of the unicodedata is normalize and pass it the NFKC flag.

0

精彩评论

暂无评论...
验证码 换一张
取 消