开发者

Stanford Core NLP - understanding coreference resolution

开发者 https://www.devze.com 2023-03-16 23:34 出处:网络
I\'m having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools.

I'm having some trouble understanding the changes made to the coref resolver in the last version of the Stanford NLP tools. As an example, below is a sentence and the corresponding CorefChainAnnotation:

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively开发者_Python百科 charged electrons.

{1=[1 1, 1 2], 5=[1 3], 7=[1 4], 9=[1 5]}

I am not sure I understand the meaning of these numbers. Looking at the source doesn't really help either.

Thank you


I've been working with the coreference dependency graph and I started by using the other answer to this question. After a while though I realized that this algorithm above is not exactly correct. The output it produced is not even close to the modified version I have.

For anyone else who uses this article, here is the algorithm I ended up with which also filters out self references because every representativeMention also mentions itself and a lot of mentions only reference themselves.

Map<Integer, CorefChain> coref = document.get(CorefChainAnnotation.class);

for(Map.Entry<Integer, CorefChain> entry : coref.entrySet()) {
    CorefChain c = entry.getValue();

    //this is because it prints out a lot of self references which aren't that useful
    if(c.getCorefMentions().size() <= 1)
        continue;

    CorefMention cm = c.getRepresentativeMention();
    String clust = "";
    List<CoreLabel> tks = document.get(SentencesAnnotation.class).get(cm.sentNum-1).get(TokensAnnotation.class);
    for(int i = cm.startIndex-1; i < cm.endIndex-1; i++)
        clust += tks.get(i).get(TextAnnotation.class) + " ";
    clust = clust.trim();
    System.out.println("representative mention: \"" + clust + "\" is mentioned by:");

    for(CorefMention m : c.getCorefMentions()){
        String clust2 = "";
        tks = document.get(SentencesAnnotation.class).get(m.sentNum-1).get(TokensAnnotation.class);
        for(int i = m.startIndex-1; i < m.endIndex-1; i++)
            clust2 += tks.get(i).get(TextAnnotation.class) + " ";
        clust2 = clust2.trim();
        //don't need the self mention
        if(clust.equals(clust2))
            continue;

        System.out.println("\t" + clust2);
    }
}

And the final output for your example sentence is the following:

representative mention: "a basic unit of matter" is mentioned by:
The atom
it

Usually "the atom" ends up being the representative mention but in the case it doesn't surprisingly. Another example with a slightly more accurate output is for the following sentence:

The Revolutionary War occurred during the 1700s and it was the first war in the United States.

produces the following output:

representative mention: "The Revolutionary War" is mentioned by:
it
the first war in the United States


The first number is a cluster id (representing tokens, which stand for the same entity), see source code of SieveCoreferenceSystem#coref(Document). The pair numbers are outout of CorefChain#toString():

public String toString(){
    return position.toString();
}

where position is a set of postion pairs of entity mentioning (to get them use CorefChain.getCorefMentions()). Here is an example of a complete code (in groovy), which shows how to get from positions to tokens:

class Example {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        props.put("dcoref.score", true);
        pipeline = new StanfordCoreNLP(props);
        Annotation document = new Annotation("The atom is a basic unit of matter, it   consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.");

        pipeline.annotate(document);
        Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class);

        println aText

        for(Map.Entry<Integer, CorefChain> entry : graph) {
          CorefChain c =   entry.getValue();                
          println "ClusterId: " + entry.getKey();
          CorefMention cm = c.getRepresentativeMention();
          println "Representative Mention: " + aText.subSequence(cm.startIndex, cm.endIndex);

          List<CorefMention> cms = c.getCorefMentions();
          println  "Mentions:  ";
          cms.each { it -> 
              print aText.subSequence(it.startIndex, it.endIndex) + "|"; 
          }         
        }
    }
}

Output (I do not understand where 's' comes from):

The atom is a basic unit of matter, it consists of a dense central nucleus surrounded by a cloud of negatively charged electrons.
ClusterId: 1
Representative Mention: he
Mentions: he|atom |s|
ClusterId: 6
Representative Mention:  basic unit 
Mentions:  basic unit |
ClusterId: 8
Representative Mention:  unit 
Mentions:  unit |
ClusterId: 10
Representative Mention: it 
Mentions: it |


These are the recent results from the annotator.

  1. [1, 1] 1 The atom
  2. [1, 2] 1 a basic unit of matter
  3. [1, 3] 1 it
  4. [1, 6] 6 negatively charged electrons
  5. [1, 5] 5 a cloud of negatively charged electrons

The markings are as follows :

[Sentence number,'id']  Cluster_no  Text_Associated

The text belonging to the same cluster refers to the same context.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号