开发者

Regex to extract the words from other languages

开发者 https://www.devze.com 2023-03-27 12:36 出处:网络
I know that I can extract the English letters and numbers by using the A-Za-z0-9 regex. How can I extract the words from other languages such as Arabic and only allow the letters and开发者_开发技巧 n

I know that I can extract the English letters and numbers by using the A-Za-z0-9 regex.

How can I extract the words from other languages such as Arabic and only allow the letters and开发者_开发技巧 numbers in their script and nothing else?

One way I have used is to filter out everything I don't want from the text and then I am left with the just words but this approach takes a lot of CPU time and is not efficient on large-scale applications.

Now I was wondering what other methods there were in use or someone knows that can be used to analyse the text of other languages.

How can be words extracted from languages such as Chinese, Japanese, etc which do not even use spaces between words? One approach I took to differentiate between words is to see the styles and line breaks as a method to realise that they must be different works but this approach can be unreliable sometimes when people don't use a lot of line breaks or formatting to separate different words.

So, to sum up, how can other languages can be analysed using regex?


In general, regular expressions are not powerful enough to extract words in languages that do not use a word separator (such as a space).

To extract words from Chinese, you need a huge dictionary of known words, and you partition a sentence according to the known words, favoring longer dictionary entries (because each character by itself is a valid word).

To extract words from Japanese, it depends on the style of writing. If the text is entirely in kana, then use the dictionary approach mentioned above. If the text is in a standard mix of kanji and kana, then you can at least know that every kana-to-kanji transition is almost surely the start of a new word.


Suppose we have a Chinese dictionary at hand and we want to split a phrase like this: 中國是位於亞洲東部的一個廣大地域或國度

One approach is to scan from the left and grab as many letters as possible while still having a word in the dictionary. Then we move forward by that many letters and repeat. This approach, called the greedy method, would give us this splitting of the phrase: [中國][是][位於][亞洲][東部][的][一][個][廣大][地域][或][國度]

This is not the only approach, as sometimes the best split is not left-greedy. For example, if we have the dictionary {A, B, C, D, AB, BCD} and the text ABCD, then we can split the text as [AB][C][D] or as [A][BCD]. The latter split may be preferred over the former.

Conveniently, this web page can demonstrate the word splitting in practice: http://www.mdbg.net/chindict/chindict.php


If you just want to filter by character and not some higher order linguistic construct, you can do the exact same thing with most languages - you just need a regular expression library that supports Unicode. You can find the a list here and filter based on those ranges.


Just as [A-Za-z0-9] can use used for English text (roughly), so [\p{Script=Arabic}0-9] can be used for Arabic text (roughly).

0

精彩评论

暂无评论...
验证码 换一张
取 消