Python decoding works for me but not others_问答_开发者

I'm sure this question has been answered somewhere, but I have no idea what to search for. My problem is not so much my problem as everyone else's. Long story short, I have a Python script with text decoding, and it decodes fine for me but fails for other users, even with the same code and input.

I've written a script (source on Bitbucket) that converts Windows Mobile 6 SMSes (via PIM Backup output) to Android SMSes (inputting via SMS Backup & Resotre) by converting the PIM Backup content to a SMSB&R-compatible XML format.

Now, PIM Backup outputs its content in UCS-2 Little Endian format, which is nice since it supports all kinds of international conversations. In my script, I load the content using Python's in-built string decoding and create a csv reader object with:

# Read the file contents
sms_text = csv_file.read().decode('utf-16').split(os.linesep)
sms_reader = csv.reader(sms_text, delimiter=';', quotechar='"', escapechar='\\')

Then I process each line of the csv reader with:

row = sms_reader.next()

I have this in a try block because very occasionally it throws a UnicodeEncodeError when something's not quite right. But again, this is very infrequent for me.

My problem is that this seems to get thrown pretty much all the time for other users of my script using non-ASCII characters in their SMSes. A German user contacted me just recently saying only about 10% of his SMSes decoded correctly. He sent me his .pib file, I ran it through my script, and didn't have a single problem in the conversion. All the output seemed to be standard ANSI/ISO 8859-1/Windows-1252/whatever, so hardly exotic.

My question is开发者_Go百科 why might it be that these users are failing to decode their inputs when I have no problems, using exactly the same code (and version of Python)? And as a follow-up, what can I do to amend my script to make it work for everyone?

EDIT: One important point I failed to mention is that I'm running the script in Eclipse using PyDev. When I run it in the command prompt, it throws all the same problems as it does for everyone else! I still don't know what the problem is, but hopefully that helps narrow it down.

An example of a very simple .csm file (extracted from the .pib file, names and numbers changed) with non-standard characters would be the following:

Msg Id;Sender Name;Sender Address;Sender AddressType;Prefix;Subject;Body;BodyType;Folder;Account;Msg Class;Content Length;Msg Size;Msg Flags;Msg Status;Modify Time;Delivery Time;Recipient Nbr;Recipients;Attachment Nbr;Attachments
0x00,0x00;"491703000000";"491703000000";;"";"Wir wünschen dem rainer alles gute und viel gesundheit! Bis nächste woche, wir hören uns bis dahin noch mal.. Liebe grüße aus md!";"";0;"\\%MDF3";"SMS";"IPM.SMStext";;;33;262144;2007,09,23,19,44,32;2007,09,23,19,44,31;1;"851980\;Gela\;+491739000000\;1\;0\;SMS";0;""

It's non-trivial to catch exactly what the problem is just by working with that string however, since I don't experience the exception myself.

Another example in which I do have problems (even in Eclipse) is the following:

Msg Id;Sender Name;Sender Address;Sender AddressType;Prefix;Subject;Body;BodyType;Folder;Account;Msg Class;Content Length;Msg Size;Msg Flags;Msg Status;Modify Time;Delivery Time;Recipient Nbr;Recipients;Attachment Nbr;Attachments
0x00,0x00;"Jonas/M";"\"Jonas/M\" <+46737000000>";;"";"Den går 28 ";"";2;"\\%MDF4";"SMS";"IPM.SMStext";0;24;0;0;2011,03,12,21,15,19;2011,03,12,21,16,17;0;"";0;""
0x00,0x00;"Don Vär";"\"Don Vär\" <+46709000000>";;"";"försöke® dhdjhdhhdjehdejehţýùhbfvfghjujhuikjkłánjajnxsjajmsxnsmajmkjsnshdjnsjmwkjhdnjsjmwkjdhjjdewjjwjwjw®";"";2;"\\%MDF1";"SMS";"IPM.SMStext";0;212;1;0;2010,05,17,15,56,49;2010,05,17,15,55,46;0;"";0;""

The exception traceback is:

Traceback (most recent call last):
  File "C:\Programming\workspace\pim2smsbr\src\pim2smsbr.py", line 207, in <module>
    convert(args.source[0], args.out)
  File "C:\Programming\workspace\pim2smsbr\src\pim2smsbr.py", line 98, in convert
    row = sms_reader.next()
  File "C:\Python27\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\ue403' in position 77: character maps to <undefined>

UPDATE:

John Machin's answer below works a treat. I simply changed one line and it's all good. Change:

sms_text = csv_file.read().decode('utf-16').split(os.linesep)

To:

sms_text = csv_file.read().decode('utf-16').encode('utf-8').splitlines()

You could start off by giving us a sample of the PIM backup file that you can read and the German user can't read.

The fact that you occasionally get a UnicodeEncodeError (note Encode not Decode) is significant. Care to change you code to display the exact error message and traceback that you get, instead of suppressing them?

Are you running this on Linux/OSX/Windows? If windows, in a Command Prompt window? If so, what does the CHCP command tell you? What does it tell your German correspondent?

Have you read what the csv docs have to say about Unicode? This is what happens:

>>> import csv
>>> r = csv.reader([u"\xA0"])
>>> r.next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
>>>

You have a much better chance of getting this to work if you take the following steps:

read the raw bytes in the file
decode the byte string to Unicode using UTF-16
encode the Unicode string in UTF-8
split the UTF-8 string into a list of lines (use str.splitlines())
make a csv reader out of that list
iterate over the rows, decoding each cell from UTF-8 to Unicode.

Update I see nothing in your edits of your question to make me change my previous advice. You have the choice of omitting step 6 above (this will work but is evil) or including step 6 and rewriting your output phase to use [c]ElementTree or lxml to do the UTF-8 encoding, escaping, etc. By the way, you are writing XML files that say they are encoded in UTF-8. I can't reproduce this because I don't have Eclipse, but I suspect that the XML files that you write "OK" when running under Eclipse are actually encoded in cp1252. Have you tried them with an XML validator?

Your issue with the U+E403 character is just part of the problem that your script will "work" only with characters that are represented in whatever is the encoding that the csv module picks when confronted with unicode input. That character is in one of the PUA (Private User Area) blocks set aside for vendor-specific stuff (e.g. the Apple symbol) or application stuff. It's not covered by any of Python's supplied encodings, and can't be rendered properly (because it's not in a published font). googling("emoji E403") and following the resulting leads indicate that it may be U+1F614 PENSIVE FACE, new in Unicode 6.0.