Regular Expression: Split English and Non-English words with Comma?_问答_开发者

Regular Expression: Split English and Non-English words with Comma?

开发者 https://www.devze.com 2022-12-11 14:06 出处：网络

开发者_运维问答Is there any regular expression pattern to change this string This is a mix string of üößñ and English. üößñ üößñ are Unicode words.

开发者_运维问答Is there any regular expression pattern to change this string

This is a mix string of üößñ and English. üößñ üößñ are Unicode words.

to this?

This is a mix string of, üößñ, and English., üößñ üößñ, are Unicode words.

Actually, I want to split English words and non-English words with comma.

Thanks.

javascript

/((?:\ [^\w\d]+)+)/g

'This is a mix string of üößñ and English. üößñ üößñ are Unicode words.'.replace(/((?:\ [^\w\d]+)+)/g,',$1,')

This is a mix string of, üößñ, and English., üößñ üößñ, are Unicode words.

Mark

No regular expression can detect strings in a particular language, but you can certainly match characters in (or not in) a range of code points, by using unicode literals, such as

/[\u0900-\u097F]+/

which matches a sequence of Devanagari characters.

Remember that a Script (a collection of characters) can be used by many languages.

Sure, you can use \x to filter specific ASCII code ranges

For example (in JavaScript):

var x = "This is a mix string of üößñ and English. üößñ üößñ are Unicode characters.";
x.replace(/([^\x00-\x80]+\s)+/g, function(match) { return match.slice(0,-1)+", "; } ); // matches characters outside the 0-128 ASCII range

Output:

This is a mix string of üößñ, and English. üößñ üößñ, are Unicode characters.

I'm sure another regex savvy person can optimize further, but this is the best I can think of half-awake :)

    String s = "This is a mix string of üößñ and English. üößñ üößñ are Unicode words.";
    System.out.println(s.replaceAll("((?: ?[\\p{L}&&[^A-Za-z]]+)+)", ",$1,"));

Unicode scripts define about 45 different language scripts. The above simply detects any unicode not in the ASCII range.