开发者

Issues with text parsing, Character looks like a longer 'hyphen' and has 3 ASCII values

开发者 https://www.devze.com 2023-02-09 07:43 出处:网络
Here is the devilish character ‐; inspecting it开发者_Go百科 I got 3 ASCII values: ASCII code 226 128 147

Here is the devilish character ; inspecting it开发者_Go百科 I got 3 ASCII values:

ASCII code 226 128 147

Now I want to some how use this character in my regular expression.


None of those is an ASCII value, because the ASCII range is 0 through 127, and nothing higher. Code point U+2010 HYPHEN in UTF-8 is written with the three byte values you list there, as revealed by:

$ perl -CS -e 'print "\x{2010}"' | perl -C0 -ne 'printf "%vd\n",$_'
226.128.144

You can get the name and character properties of that code point using the uniprops script:

$ uniprops U+2010
U+2010 ‹‐› \N{ HYPHEN }:
    \pP \p{Pd}
    All Any Assigned InGeneralPunctuation Common Zyyy Dash Dash_Punctuation Pd P General_Punctuation Gr_Base Grapheme_Base Graph GrBase Hyphen Punct Pat_Syn Pattern_Syntax PatSyn Print Punctuation

Other common code points with the Unicode Dash property include these shown by the unichars script:

 $ unichars '\p{Dash}'
 -    45 002D HYPHEN-MINUS
 ‐  8208 2010 HYPHEN
 ‑  8209 2011 NON-BREAKING HYPHEN
 ‒  8210 2012 FIGURE DASH
 –  8211 2013 EN DASH
 —  8212 2014 EM DASH
 ―  8213 2015 HORIZONTAL BAR
 ⁓  8275 2053 SWUNG DASH
 ⁻  8315 207B SUPERSCRIPT MINUS
 ₋  8331 208B SUBSCRIPT MINUS
 −  8722 2212 MINUS SIGN


It's probably Unicode. The right answer is to use Unicode throughout. You'll ultimately get in a lot of trouble if you try to treat Unicode strings as ASCII.

0

精彩评论

暂无评论...
验证码 换一张
取 消