I'm doing text mining on content that comes from the web. There is a lot of chars that I want to convert to perform better classification (eg.:  
to white spaces).
The problem is sometimes I'm getting some unknown chars and I want to discover the Unicode codepoint and UTF-8 representation of it.
I want to know if there is some online tool that can inform this or a program.
At the moment, I'm trying to discover a line-break that I found, but don't matches the \n
or \s
from regex. In the past time, I had troubles with the  .
I don't know what is and I want to know if there is a way to discover.
The char appears here, after personagens, but is only possible to see viewing the or开发者_如何学运维iginal code without formatation.
"personagens
"
Based on the comments, it appears that you needed to know the Unicode codepoints of certain characters, or their UTF-8 representations.
You can use the character inspector application, written by McDowell, who's one of StackOverflow's users, to determine the Unicode codepoint as well as the UTF-8 representations. You'll need to set the charset as UTF-8 in the application, once you've pasted the contents of the message.
You can also use the String
class of the Java API to get the raw codepoints of characters in a String, via the codePointAt
method. Note, that if you convert the String
to a char
array, the array will contain UTF-16 encoded characters; while, this is fine if you intend to invoke the Character.codePointAt
method, you must take care to ensure that you deal with low surrogates.
Run the uniquote program:
$ echo 'bádƨtüff' | uniquote -x
b\x{E1}d\x{2060}\x{2060}\x{1A8}t\x{FC}\x{FB00}
$ echo 'bádƨtüff' | uniquote -v
b\N{LATIN SMALL LETTER A WITH ACUTE}d\N{WORD JOINER}\N{WORD JOINER}\N{LATIN SMALL LETTER TONE TWO}t\N{LATIN SMALL LETTER U WITH DIAERESIS}\N{LATIN SMALL LIGATURE FF}
$ echo 'bádƨtüff' | uniquote --html
bád⁠⁠ƨtüff
You don’t need to use echo
; you can just cut and paste, then hit ^D when you’re done:
$ uniquote -v -
'bádƨtüff'
^D
'b\N{LATIN SMALL LETTER A WITH ACUTE}d\N{WORD JOINER}\N{WORD JOINER}\N{LATIN SMALL LETTER TONE TWO}t\N{LATIN SMALL LETTER U WITH DIAERESIS}\N{LATIN SMALL LIGATURE FF}'
精彩评论