开发者

implementing SRX Segmentation Rules in JavaScript

开发者 https://www.devze.com 2022-12-29 04:03 出处:网络
I want to implement the SRX Segmentation Rules using javascript to extract sentences from text. In order to do this correctly I will have to follow the SRX rules.

I want to implement the SRX Segmentation Rules using javascript to extract sentences from text.

In order to do this correctly I will have to follow the SRX rules.

eg. http://www.lisa.org/fileadmin/standards/srx20.html#refTR29

now there are two types of regular expressions

  1. if found sentence should break like ". "
  2. if found sentence should not break like abbreviation U.K or Mr.

For this again there are 开发者_如何学编程two parts

  1. before breaking
  2. after breaking

for example if the rule is

<rule break="no">

    <beforebreak>\s*[0-9]+\.</beforebreak>
    <afterbreak>\s</afterbreak>

</rule>

Which says if the pattern "\s*[0-9]+.\s" is found the segment should not break.

how do I implement using javascript, my be split function is not enough ?


You may want to try something like this:

function segment(text, rules) {
    if (!text) return text;
    if (!rules) return [text];

    var rulePattern = /<rule(?:(\s+break="no")|\s+[^>]+|\s*)>(?:<beforebreak>([^<]+)<\/beforebreak>)?(?:<afterbreak>([^<]+)<\/afterbreak>)?<\/rule>/g;
    cleanXml(rules).replace(rulePattern, 
        function(whole, nobreak, before, after) {
            var r = new RegExp((before||'')+'(?![\uE000\uE001])'+(after?'(?='+after+')':''), 'mg');
            text = text.replace(r, nobreak ? '$&\uE000' : '$&\uE001');
            return '';
        }
    );

    var sentences = text.replace(/\uE000/g, '').split(/\uE001/g);

    return sentences;
}

function cleanXml(s) {
    return s && s.replace(/<!--[\s\S]*?-->/g,'').replace(/>\s+</g,'><');
}

To run this simply call segment() with the text to split, and the rules XML as a string. For example:

segment('The U.K. Prime Minister, Mr. Blair, was seen out with his family today.',
        '<rule break="no">' +
            '<beforebreak>\sMr\.</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>' +
        '<rule break="no">' +
            '<beforebreak>\sU\.K\.</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>' +
        '<rule break="yes">' +
            '<beforebreak>[\.\?!]+</beforebreak>' +
            '<afterbreak>\s</afterbreak>' +
        '</rule>'
);

The call to segment() will return an array of sentences, so you can simply do something like alert(segment(...).join('\n')) to see the result.

Known Limitations:

  1. It expects the rules to be after the cascading process that is relevant for the specific language.
  2. It expects the regular expressions used by the rules to conform to the javascript regexp syntax.
  3. It does not handle internal markup.

All of these limitations seem quite easy to overcome.

How does this work?

The segment function uses the rulePattern to extract each rule, identify if it is a breaking or non-breaking rule, and create a regexp based on the beforebreak and afterbreak clauses of the rule. It then scans the text, and marks each matching place by adding a unicode character (taken from a unicode private use area) that marks whether it is a break (\uE001) or a non-break (\uE000). If another marker is already positioned in the same place, the rule is not matched, to preserve rule priorities.

Then it simply removes the non-break marks, and splits the text according to the break marks.

@Sourabh: I hope this is still relevant for you.

0

精彩评论

暂无评论...
验证码 换一张
取 消