Python, best approach for unicode support?_问答_开发者

Python, best approach for unicode support?

开发者 https://www.devze.com 2023-03-09 00:37 出处：网络

I have a Python application that gets multilingual information from web开发者_如何学运维sites, and it presents them in a small GUI window (wxpython based).

I (currently) don't use any specific unicode statements in my source files.

Now, when I run my python application from within Eclipse, French characters (like ë) are displayed nicely, when I run it from a py2exe packaged version, the character go wonky. I don't really understand why as the building with py2exe doesn't produce unicode or encoding related errors.

However, to fix this issue, and following this article, I wrapped my strings in a unicode(my_string, "utf-8") call just before outputting it to screen. This solves it.

Questions:

Is wrapping strings in a unicode() call just before displaying the good way to do it?
why does it work without the unicode conversion from within Eclipse, but not from a windows packaged .exe version?

I tried wrapping my head already many times around unicode, but it seems I am not unicode compatible :-|

The best approach is to ensure the strings are unicode as soon as possible. If the library you are scraping websites with are not proving you with unicode then they are not doing what they should (imho). Then you have to your self decode them to unicode using the same encoding that the web pages you are scraping is using.

Your approach is basically the opposite, decoding as late as possible. That it has worked so far is basically just pure luck because you have not encountered any non-utf8 strings yet. Any iso-8859-1 strings will break your app.

why does it work without the unicode conversion from within Eclipse, but not from a windows packaged .exe version?

I assume that you're using PyDev in Eclipse?

It happened to me too recently, PyDev change the sys.getDefaultEncoding() to "utf-8". Meaning that reading and writing from file (or anything else) will be by default in UTF-8. But once I launched it from the console, it was back to the OS default (e.g. ascii for Windows)

The good practice to declare strings, is to put a u before it:

u"the string"

So that the string is in UTF-8. It becomes default in Python 3+

I may be wrong, but I think it was working in Eclipse because work in UTF-8 by default and py2exe produce Windows executable which is Latin-1.

By using unicode(a_string, "UTF-8"), your create a Python unicode object explicitly in UTF-8 encoding. So, the interpreter take this encoding when using the object.

An unicode object may be used as a string transparently in a lot of method/function/class, including print. Anyway, be warned that sometime, you must use a string as function argument.

Do you put, @top of your file, something # -*- coding: utf-8 -*- which indicate to the interpreter "Any string in this file IS in UTF-8" ?

It may let you avoid explicit conversion of your strings to unicode objects.