I'm writing some code to parse RTF documents, and need to handle the various codepages they can use. Python comes with decoders for all the nece开发者_运维百科ssary Windows codepages, but I'm not sure how to handle the Mac ones:
# 77: "10000", # Mac Roman
# 78: "10001", # Mac Shift Jis
# 79: "10003", # Mac Hangul
# 80: "10008", # Mac GB2312
# 81: "10002", # Mac Big5
# 83: "10005", # Mac Hebrew
# 84: "10004", # Mac Arabic
# 85: "10006", # Mac Greek
# 86: "10081", # Mac Turkish
# 87: "10021", # Mac Thai
# 88: "10029", # Mac East Europe
# 89: "10007", # Mac Russian
Does Python have any built-in support for these? If not, is there a cross-platform pure-Python library that will handle them?
You can use the python codecs for these that are known by their names 'mac-roman', 'mac-turkish', etc.
>>> 'foo'.decode('mac-turkish')
u'foo'
You'll have to refer to them by their names, these numbers you've got in your question don't appear in the source files. For more information look at $pylib/encodings/mac_*.py
.
It seems that at least Mac Roman and Mac Turkish encodings exist in Python stdlib, under names macroman and macturkish. See http://svn.python.org/projects/python/trunk/Lib/encodings/aliases.py for a complete list of encoding aliases in the most up-to-date Python.
No.
However, unicode.org provides codec description files that you can use to generate modules that will parse those codecs. Included with python source distributions is a script that will convert these files: Python-x.x/Tools/unicode/gencodec.py
.
精彩评论