开发者

How to discover the Unicode codepoint and UTF-8 encoded value of a unknown character?

开发者 https://www.devze.com 2023-03-30 15:05 出处:网络
I\'m doing text mining on content that comes from the web. There is a lot of chars that I want to convert to perform better classification (eg.: &nbsp to white spaces).

I'm doing text mining on content that comes from the web. There is a lot of chars that I want to convert to perform better classification (eg.: &nbsp to white spaces).

The problem is sometimes I'm getting some unknown chars and I want to discover the Unicode codepoint and UTF-8 representation of it.

I want to know if there is some online tool that can inform this or a program.

At the moment, I'm trying to discover a line-break that I found, but don't matches the \n or \s from regex. In the past time, I had troubles with the &nbsp.

I don't know what is and I want to know if there is a way to discover.

The char appears here, after personagens, but is only possible to see viewing the or开发者_如何学运维iginal code without formatation.

"personagens "


Based on the comments, it appears that you needed to know the Unicode codepoints of certain characters, or their UTF-8 representations.

You can use the character inspector application, written by McDowell, who's one of StackOverflow's users, to determine the Unicode codepoint as well as the UTF-8 representations. You'll need to set the charset as UTF-8 in the application, once you've pasted the contents of the message.

You can also use the String class of the Java API to get the raw codepoints of characters in a String, via the codePointAt method. Note, that if you convert the String to a char array, the array will contain UTF-16 encoded characters; while, this is fine if you intend to invoke the Character.codePointAt method, you must take care to ensure that you deal with low surrogates.


Run the uniquote program:

$ echo 'bád⁠⁠ƨtüff' | uniquote -x
b\x{E1}d\x{2060}\x{2060}\x{1A8}t\x{FC}\x{FB00}

$ echo 'bád⁠⁠ƨtüff' | uniquote -v
b\N{LATIN SMALL LETTER A WITH ACUTE}d\N{WORD JOINER}\N{WORD JOINER}\N{LATIN SMALL LETTER TONE TWO}t\N{LATIN SMALL LETTER U WITH DIAERESIS}\N{LATIN SMALL LIGATURE FF}

$ echo 'bád⁠⁠ƨtüff' | uniquote --html
bád⁠⁠ƨtüff

You don’t need to use echo; you can just cut and paste, then hit ^D when you’re done:

$ uniquote -v -
'bád⁠⁠ƨtüff'
^D
'b\N{LATIN SMALL LETTER A WITH ACUTE}d\N{WORD JOINER}\N{WORD JOINER}\N{LATIN SMALL LETTER TONE TWO}t\N{LATIN SMALL LETTER U WITH DIAERESIS}\N{LATIN SMALL LIGATURE FF}'
0

精彩评论

暂无评论...
验证码 换一张
取 消