Java - removing strange characters from a String_问答_开发者

How do I remove strange and unwanted Unicode characters (such as a black diamond with question mark) from a String?

Updated:

Please tell me the Unicode character string开发者_C百科 or regex that correspond to "a black diamond with question mark in it".

A black diamond with a question mark is not a unicode character -- it's a placeholder for a character that your font cannot display. If there is a glyph that exists in the string that is not in the font you're using to display that string, you will see the placeholder. This is defined as U+FFFD: �. Its appearance varies depending on the font you're using.

You can use java.text.normalizer to remove Unicode characters that are not in the "normal" ASCII character set.

You can use a String.replaceAll("[my-list-of-strange-and-unwanted-chars]","")

There is no Character.isStrangeAndUnWanted(), you have to define what you want.

If you want to remove control characters you can do

String str = "\u0000\u001f hi \n";
str = str.replaceAll("[\u0000-\u001f]", "");

prints hi (keeps the space).

EDIT If you want to know the unicode of any 16-bit character you can do

int num = string.charAt(n);
System.out.println(num);

To delete non-Latin symbols from the string I use the following code:

String s = "小米体验版 latin string 01234567890";
s = s.replaceAll("[^\\x00-\\x7F]", "");

The output string will be: " latin string 01234567890"

Justin Thomas's was close, but this is probably closer to what you're looking for:

String nonStrange = strangeString.replaceAll("\\p{Cntrl}", "");

The selector \p{Cntrl} selects "A control character: [\x00-\x1F\x7F]."

Use String.replaceAll( ):

String clean = "♠clean".replaceAll('♠', '');

I did the other way. I replace all letters that are not defined ((^)):

str.replaceAll("[^a-zA-Z0-9:;.?! ]","")

so for words like : "小米体验版 latin string 01234567890" we will get: "latin string 01234567890"

Put the characters that you want to get rid of in an array list, then iterate through the array with a replaceAll method:

String str = "Some text with unicode !@#$";
ArrayList<String> badChar = new ArrayList<String>();
badChar= ['@', '~','!']; //modify this to contain the unicodes

for (String s : badChar) {
   String resultStr = str.replaceAll(s, str);
}

you will end up with a cleaned string "resultStr" haven't tested this but along the lines.

same happened with me when i was converting clob to string using getAsciiStream.

efficiently solved it using

public String getstringfromclob(Clob cl)
{
    StringWriter write = new StringWriter();
    try{
        Reader read  = cl.getCharacterStream();     
    int c = -1;
    while ((c = read.read()) != -1)
    {
        write.write(c);
    }
    write.flush();
    }catch(Exception ec)
    {
        ec.printStackTrace();
    }
    return write.toString();

}

filter English ,Chinese,number and punctuation

str = str.replaceAll("[^!-~\\u20000-\\uFE1F\\uFF00-\\uFFEF]", "");

Most probably the text that you got was encoded in something other than UTF-8. What you could do is to not allow text with other encodings (for example Latin-1) to be uploaded:

try {

  CharsetDecoder charsetDecoder = StandardCharsets.UTF_8.newDecoder();
  charsetDecoder.onMalformedInput(CodingErrorAction.REPORT);

  return IOUtils.toString(new InputStreamReader(new FileInputStream(filePath), charsetDecoder));
}
catch (MalformedInputException e) {
  // throw an exception saying the file was not saved with UTF-8 encoding.
}

You can't because strings are immutable.

It is possible, though, to make a new string that has the unwanted characters removed. Look up String#replaceAll().