How can I write a regular expression to efficiently match both Swedish and English words?
I must be able to match the lik开发者_StackOverflow社区es of Å, é and '. I think 123 is also a word. I even think 1:e and 1st are words...
How would I proceed if I wish to match words from Russian and Japanese also.
Thanks,
Barry
P.S. The following are not words and should not be matched:
, =HELLO=, @NEW_LINE_MARKER, can"t, hel*o, /new/Also,
This string "Hey! What? Yes, I'm coming." should be split into:
(Hey, What, Yes I'm coming)
Japanese
Detecting word boundaries in CJK texts requires knowledge of the language at a fluent level; These texts are not written with any sort of word separation, and lack a distinct structure in written form that distinguishes one word from the next. More on the subject.
Roman texts (English, Swedish) and most Cyrillic texts (Russian), are divided on whitespace and certain punctuation (period, comma, dash, but not hyphen).
精彩评论