How to preg_match_all a set of words in any possible language?_问答_开发者

How to preg_match_all a set of words in any possible language?

开发者 https://www.devze.com 2023-04-02 04:47 出处：网络

I have a website that people enter lists of words into. These lists of words could be written in any language in the world.

相关专题：php regex utf-8

I have a website that people enter lists of words into.

These lists of words could be written in any language in the world.

How can I extract these lists of words from their input data if I do not know what language they are entering?

Is there some kind of match-all international alphabet symbol I am missing, or do I have to manually write up a set of brackets that will match every possible international letter?

Is this what I am looking for and just don'开发者_如何转开发t know it yet?

You can use Unicode character properties, for example:

preg_match_all('#[\p{L}\p{Pc}]+#u', $str, $matches);

[\p{L}\p{Pc}]+ gives you letters and connector punctuation. You can shorten that to \pL+.
Either way, you'd want to define "word" better. It is probably more than a sequence of some letters...

My recommendation is to define your own input convention - force them to input one word at a time, or one word per line in a textbox. Else, you will need a segmentation algorithm for each script (granted, it will be something trivial like "split on characters which have the Unicode word separator property" for the vast majority of scripts, but the remaining special cases are basically still open AI research topics).