开发者

Removing diacritics and platform issue

开发者 https://www.devze.com 2023-03-04 06:22 出处:网络
I have this method to remove diacritics from string in Java: String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);

I have this method to remove diacritics from string in Java:

String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD);
Pattern pattern = Patte开发者_StackOverflow中文版rn.compile("\\p{InCombiningDiacriticalMarks}+");
return pattern.matcher(nfdNormalizedString).replaceAll("");

I have a few simple test cases for this. They pass when I run them from inside my IDE, but fail when I try them from Maven. I invoke maven from command line, and my environment encoding is UTF-8. I am running Java 6 latest patch that Apple provided.

I don't know what is the encoding inside the IDE, but it uses the same Java. Any thought on what might cause this problem?


I believe it's caused by improper handling of input encoding.

If input strings are specified in the source, you need to make sure that encoding of the source matches the encoding in compiler configuration. Note that Maven requires a separate configuration of compiler encoding as a property named project.build.sourceEncoding in pom.xml:

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    ...
</properties>

As a quick check you can also replace characters in string literals by their Unicode escapes (\uxxxx) - if problem is caused by source encoding, it should disappear.

If you read input date from the file, make sure that you correctly specify encoding of the file in your code and don't use methods that relies on system default encoding.

See also:

  • Unicode - How to get the characters right?
0

精彩评论

暂无评论...
验证码 换一张
取 消