开发者

Does the Java VM becomes slower depending on its encoding?

开发者 https://www.devze.com 2023-02-26 19:20 出处:网络
Suppose a Spanish team-mate writes a class with like TipoNotificación. Note开发者_开发知识库 the special characters like ú, ó, etc.

Suppose a Spanish team-mate writes a class with like TipoNotificación. Note开发者_开发知识库 the special characters like ú, ó, etc.

Beyond the coding project normalization, what kind of troubles could I face?


Beyond the coding project normalization

That should be a reason enough to exclude non-ascii characters in identifiers:

  1. some characters are visually undistinguishable (U+0041 / U+0391), in extreme case it may lead to confusions
  2. not everyone has a keyboard that allows the typing of [a]cute characters easily; it may be frustrating for developers.

As for your original question, I don't think there's any significant overhead. As already stated, strings are internally stored in UTF-16. However filenames (including class filenames) in JAR files are encoded in UTF-8, which means the JVM reads one extra byte for each non-ascii character at load time. Since Spanish has at most one diacritic per word, you can expect an average of one or two additional bytes per class. There's no way you could notice it even in the most limited hardware environments.


Names of classes are only used at link time (and reflection), so your application should be unaffected once it is up and running. I can't imagine the overhead of decoding multi-byte characters will be significant.

OTOH, you could end up with the usual problems with file system names, text editor character encoding and perhaps even jar/zip file names.


The only thing that would (should) be affected is the time it takes to load and process text files. Class files (binaries) should not be affected. Make sure your Java IDE and build system are set up properly. If you are using Maven, you will be prompted to set your character set encoding in numerous places.

The JVM does store data as UCS-2 aka UTF-16. That means that every character is stored internally with two bytes of data. This can sometimes be an unpleasant surprise for folks coming from a C background, where each character is usually an ASCII byte (with the high bit undefined). You can spend weeks learning and torturing yourself over encodings.

Probably the single piece of advice I can give that will be helpful is to set EVERYTHING to UTF-8. Just standardize on that, everywhere. In your IDEs, text editors, builds, JSP pages, and especially in your database. Write unit tests and integration tests to ensure that everything is set to UTF-8. You really don't want to have to deal with data migration/cleanup, trying to figure out what random encoding led to a particular string of weird characters.

Here's a slide deck on I18N I wrote a while ago, hopefully this will help.

http://www.slideshare.net/williverson/software-internationalization-crash-course

Oh, and you should assume that any file names that will ever go over a network (e.g. file share, email) will be screwed up and rendered as ASCII or the local OS encoding. For example, on Macs that will be MacRoman and on US English systems CP1251. So, if you bundle up your classes in a JAR it's probably ok, but classes (or source files!) unexploded will have a problem. Not the JVM, but an OS-level thing.


No, it shouldn't cause any problems at runtime. Java stores all strings internally as UTF-8, anyway. The only problems you might run into is with managing the source files.


Java encodes Strings using UTF16 and it easily covers for characters with accent without increasing need for memory. Therefore, the answer to your question is no.

0

精彩评论

暂无评论...
验证码 换一张
取 消