I'm trying to match kanji compounds in a Japanese sentence using regex.
Right now, I'm using / ((.)*) /
to match a space delimited compound in, for example, 彼はそこに ひと人 でいた。
The problem is, that in some sentence the word is at the beginning, or followed with a punctuation characters. Ex. いっ瞬 の間が生まれた。
or 一昨じつ、彼らはそこを出発した。
I've tried something like / ((.)*) |^((.)*) | ((.)*)、 etc.
But this matches 彼はそこに ひと人
instead of ひと人
in 彼はそこに ひと人 でいた。
Is there any way to 开发者_开发问答pack all this in a single regex, or do I have to use one, check whether it returned anything, then try another one if not?
Thanks!
P.S.: I'm using PHP to parse the sentences.
Assuming your input is in UTF-8 you could try with
'/(\pL+)/u'
The \pL+
matches one or more letter in the string.
Example:
$str = '彼はそこに ひと人 でいた。';
preg_match_all('/(\pL+)/u', $str, $matches);
var_dump($matches[0]);
Output:
array(3) {
[0]=>
string(15) "彼はそこに"
[1]=>
string(9) "ひと人"
[2]=>
string(9) "でいた"
}
I think this: /([^ 、]+)/
should match the words in examples you've given (you may want to add some other word-terminating chars apart from space and 、 if you have them in your texts (or use \pL
instead of [^ 、]
to cover all UTF letters.
EXAMPLE
<?
preg_match_all('/[^ 、]+/u', "彼らは日本の 国民 となった。", $m);
print_r($m);
outputs
Array
(
[0] => Array
(
[0] => 彼らは日本の
[1] => 国民
[2] => となった。
)
)
you're trying only to split your string according to some pattern (white space, or punctuation), is that true?? what about this?
In [51]: word = '.test test\n.test'
In [53]: re.split('[\s,.]+',word)
Out[53]: ['', 'test', 'test', 'test']
After thinking about it for a long time I believe there's no way to parse the compounds without delimiting them all with spaces or any other characters which is what I'm doing now :)
Ex. if the sentence is 私は ノート、ペンなどが必要だ。
, there is no way for the computer to know whether it's 私は
(start sentence & space delimited) or ノート
(space & comma delimited) that is the right it should choose.
Thanks everyone for your suggestions...
精彩评论