Decoding Mac OS text in Python_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-08 18:39 出处：网络

I\'m writing some code to parse RTF documents, and need to handle the various codepages they can use. Python comes with decoders for all the nece开发者_运维百科ssary Windows codepages, but I\'m not su

相关专题：macos python

I'm writing some code to parse RTF documents, and need to handle the various codepages they can use. Python comes with decoders for all the nece开发者_运维百科ssary Windows codepages, but I'm not sure how to handle the Mac ones:

# 77: "10000", # Mac Roman
# 78: "10001", # Mac Shift Jis
# 79: "10003", # Mac Hangul
# 80: "10008", # Mac GB2312
# 81: "10002", # Mac Big5
# 83: "10005", # Mac Hebrew
# 84: "10004", # Mac Arabic
# 85: "10006", # Mac Greek
# 86: "10081", # Mac Turkish
# 87: "10021", # Mac Thai
# 88: "10029", # Mac East Europe
# 89: "10007", # Mac Russian

Does Python have any built-in support for these? If not, is there a cross-platform pure-Python library that will handle them?

You can use the python codecs for these that are known by their names 'mac-roman', 'mac-turkish', etc.

>>> 'foo'.decode('mac-turkish')
u'foo'

You'll have to refer to them by their names, these numbers you've got in your question don't appear in the source files. For more information look at $pylib/encodings/mac_*.py.

It seems that at least Mac Roman and Mac Turkish encodings exist in Python stdlib, under names macroman and macturkish. See http://svn.python.org/projects/python/trunk/Lib/encodings/aliases.py for a complete list of encoding aliases in the most up-to-date Python.

No.

However, unicode.org provides codec description files that you can use to generate modules that will parse those codecs. Included with python source distributions is a script that will convert these files: Python-x.x/Tools/unicode/gencodec.py.