Should source code be saved in UTF-8 format_问答_开发者

How important is it to save your source code in UTF-8 format?

Eclipse on Windows uses CP12开发者_StackOverflow社区52 character encoding by default. The CP1251 format means non UTF-8 characters can be saved and I have seen this happen if you copy and paste from a Word document for a comment.

The reason I ask is because out of habit I set-up Maven encoding to be in UTF-8 format and recently it has caught a few non mappable errors.

(update) Please add any reasons for doing so and why, are there some common gotchas that should be known?

(update) What is your goal? To find the best practice so when ask why should we use UTF-8 I have a good answer, right now I don't.

What is your goal? Balance your needs against the pros and cons of this choice.

UTF-8 Pros

allows use of all character literals without \uHHHH escaping

UTF-8 Cons

using non-ASCII character literals without \uHHHH increases risk of character corruption
- font and keyboard issues can arise
- need to document and enforce use of UTF-8 in all tools (editors, compilers build scripts, diff tools)
beware the byte order mark

ASCII Pros

character/byte mappings are shared by a wide range of encodings
- makes source files very portable
- often obviates the need for specifying encoding meta-data (since the files would be identical if they were re-encoded as UTF-8, Windows-1252, ISO 8859-1 and most things short of UTF-16 and/or EBCDIC)

ASCII Cons

limited character set
this isn't the 1960s

Note: ASCII is 7-bit, not "extended" and not to be confused with Windows-1252, ISO 8859-1, or anything else.

Important is at least that you need to be consistent with the encoding used to avoid herrings. Thus not, X here, Y there and Z elsewhere. Save source code in encoding X. Set code input to encoding X. Set code output to encoding X. Set characterbased FTP transfer to encoding X. Etcetera.

Nowadays UTF-8 is a good choice as it covers every character the human world is aware of and is pretty everywhere supported. So, yes, I would set workspace encoding to it as well. I also use it so.

Eclipse's default setting of using the platform default encoding is a poor decision IMHO. I found it necessary to change the default to UTF-8 shortly after installing it because some of my existing source files used it (probably from snippets copied/pasted from web pages.)

The Java Language and API specs require UTF-8 support so you're definitely okay as far as the standard tools go, and it's a long time since I've seen a decent editor that did not support UTF-8.

Even in projects that use JNI, your C sources will normally be in US-ASCII which is a subset of UTF-8 so having both open in the same IDE will not be a problem.

Yes, unless your compiler/interpreter is not able to work with UTF-8 files, it is definitely the way to go.

I don't think there's really a straight yes or no answer to this question. I would say that the following guidelines should be used to pick an encoding format, in order of priority listed (highest to lowest):

1) Pick an encoding your tool chain supports. This is a lot easier than it used to be. Even in recent memory a lot of compilers and languages basically only supported ASCII, which more or less forced developers into coding in Western European languages. These days, many of the newer languages support other encodings, and almost all decent editors and IDEs support a tremendously long list of encodings. Still... there are just enough holdouts that you need to double check before you settle on an encoding.

2) Pick an encoding that supports as many of the alphabets you wish to use as possible. I place this as a secondary priority because frankly, if your tools don't support it it doesn't really matter whether you like the encoding better or not.

UTF-8 is an excellent choice in many circumstances of today's world. It's an ugly, inelegant format, but it solves a whole host of problems (namely dealing with legacy code) that break other encodings, and it seems to becoming more and more the de facto standard of character encodings. It supports every major alphabet, darn near every editor on the planet supports it now, and a whole host of languages/compilers support it, too. But as I mentioned above, there are just enough legacy holdouts that you need to double check your tool chain from end to end before you settle on it definitively.