开发者

Python string conversion (localization) question

开发者 https://www.devze.com 2023-03-20 23:44 出处:网络
source = \'\\xe3\\xc7\\x9f\' destination = u\'\\u0645\\u0627\\u06ba\' How do I get from the source, to the destination?
source = '\xe3\xc7\x9f'
destination = u'\u0645\u0627\u06ba'

How do I get from the source, to the destination?

(The source and the destination are both the same 3 characters, in the same order, just represented differently.)

Technically, the source is in Urdu and the destination is the Unicode code points for the same 3 characters. See: https://www.codeaurora.org/git/projects/froyo-gb-dsds-7227/repository/revisions/39141d7a9dbdd2e9acf006430a7e7557ffd1efce/entry/external/icu4c/data/mappings/ibm-5352_P100-1998.ucm

If I do:

source.decode('cp1006')

I get:

u'开发者_Go百科\ufed9\ufb84\x9f'

Which is not what I'm looking for...

If I do:

source.decode('raw_unicode_escape')

I get:

u'\xe3\xc7\x9f'

Which is also not what I'm looking for...

How do I get from point A (source) to point B (destination) in Python?


In [129]: source = '\xe3\xc7\x9f'
In [130]: source.decode('cp1256')
Out[130]: u'\u0645\u0627\u06ba'

In [131]: destination
Out[131]: u'\u0645\u0627\u06ba'

PS. The question "What codec transforms this str object into that unicode object?" comes up from time to time on SO. Here's a little script which can help answer these questions quickly (it simply tries to decode the str object with every possible encoding):

guess_encoding.py:

import binascii
import zlib
import codecs
import pkgutil
import os
import encodings

def all_encodings():
    modnames=set([modname for importer, modname, ispkg in pkgutil.walk_packages(
        path=[os.path.dirname(encodings.__file__)], prefix='')])
    aliases=set(encodings.aliases.aliases.values())
    return modnames.union(aliases)        

def main():
    encodings=all_encodings()
    while 1:
        text=raw_input()
        text=codecs.escape_decode(text)[0]
        # print('Attempting to decode {0!r}'.format(text))
        for enc in encodings:
            try:
                msg=text.decode(enc)
            except (IOError,UnicodeDecodeError,LookupError,
                    TypeError,ValueError,binascii.Error,zlib.error) as err:
                pass
                # print('{e} failed: {err}'.format(e=enc,err=err))
            else:
                if msg:
                    print('Decoding with {enc}:'.format(enc=enc))
                    print(msg)

if __name__=='__main__':
    main()

After running guess_encoding.py you type in the repr of the str object:

% guess_encoding.py
\xe3\xc7\x9f

It spits out the associated unicode object with respect to every possible Python encoding.

Since you told us the desired unicode object was

In [128]: print(destination)
ماں

you can quickly search the output for ماں and find the successful codec:

Decoding with cp1256:
ماں
0

精彩评论

暂无评论...
验证码 换一张
取 消