开发者

How do I match words in different languages

开发者 https://www.devze.com 2023-03-24 21:27 出处:网络
How can I write a regular expression to efficiently match both Swedish and English words? I must be able to match the lik开发者_StackOverflow社区es of Å, é and \'. I think 123 is also a word. I eve

How can I write a regular expression to efficiently match both Swedish and English words?

I must be able to match the lik开发者_StackOverflow社区es of Å, é and '. I think 123 is also a word. I even think 1:e and 1st are words...

How would I proceed if I wish to match words from Russian and Japanese also.

Thanks,

Barry

P.S. The following are not words and should not be matched:

, =HELLO=, @NEW_LINE_MARKER, can"t, hel*o, /new/

Also,

This string "Hey! What? Yes, I'm coming." should be split into:

(Hey, What, Yes I'm coming)


Japanese

Detecting word boundaries in CJK texts requires knowledge of the language at a fluent level; These texts are not written with any sort of word separation, and lack a distinct structure in written form that distinguishes one word from the next. More on the subject.

Roman texts (English, Swedish) and most Cyrillic texts (Russian), are divided on whitespace and certain punctuation (period, comma, dash, but not hyphen).

0

精彩评论

暂无评论...
验证码 换一张
取 消