I am working on languge segmentation project. I applied language segmentation for English by using regular expression breaking the string at . ("Full Stop"). Now i want to provide the support for following languages (Chinese, Arabic, Japanese, Russian, Korean, Dutch, Hindi, Greek, Urdu). I want to break the above mentioned language strings on Full stop.
e.g.
For Chinese Full stop is 。 (Unicode value U+3002) String
以有效應對各種事態」。他還表示,希开发者_高级运维望以符合21世紀的方式切實深化美日同盟關係。
Expected Result
Segment 1 :- 以有效應對各種事態」。
Segment 2 :- 他還表示,希望以符合21世紀的方式切實深化美日同盟關係。
Same logic I have to apply for other languages (Arabic, Japanese, Russian, Korean, Dutch, Hindi, Greek, Urdu).
See String.split. You can use /([。])/
as a regular expression separator. Add the other punctuation characters inside the square brackets. The round parentheses will capture your delimiters.
In php you might use preg_split( REGEX , $yourString );
Replace the word REGEX with your regular expression. Possibly like @janmoesen mentioned.
精彩评论