Let's say there is a sentence:
On Marc开发者_运维百科h 1, he was born.
Changing it to
He was born on March 1.
doesn't break the sense of the sentence and it is still valid. Shuffling words in any other way would produce weird to invalid sentences. So basically, I'm talking about parts of the sentence, which make the information more specific, but removing them doesn't break the whole sentence. Is there any NLP library in which identifying such parts is available?
Constituents
It sounds like you want to identify the sentence's constituents, which are groups of words that operate as a single unit according to the grammar of a language.
In fact, when linguistics are trying to discover a language's grammar, they do it in part by looking at movement. As in your example, this is where a group of words can be moved to a different position in a sentence while still preserving the meaning of the sentence.
Constituents can be individual words, phrases, or even larger groups such as whole clauses. Within a sentence, they have a nested hierarchical structure. For instance, the first example sentence you gave could be analyzed as:
(S (PP (IN On) (NP (NNP March) (CD 1)))
(NP (PRP he))
(VP (VBD was) (VP (VBN born))))
The whole sentence is made up of a prepositional phrase, followed by a noun phrase, and then a verb phrase. The prepositional phrase can be further decomposed into a unit consisting of the single word 'On' followed by a noun phrase.
Phrase Structure Parsers
To find constituents automatically, you will probably want to use a phrase structure parser. There are many such parses to choose from that are available as open source, including:
- Stanford Parser (Java)
- Berkeley Parser (Java)
- BLLIP (Charniak-Johnson) Parser (C++)
- Bikel Parser (this is a reimplemented and improved version of the Collins parser write in Java)
- Collins Parser (C++)
- OpenNLP Parser (Java)
- SharpNLP Parser (C#)
The Stanford and Berkeley parsers are probably the easiest to install and use. As seen in Cer et al. 2010, the most accurate parsers are Berkeley and Charniak. The Bikel parser is slower and less accurate than the others.
Online Demo
There's an online demo for the Stanford parser here. I used the demo to produce the parse given above of your example sentence.
A Note About Deletion
Within each constituent, there will be a head word. For example, take the noun phrase:
(NP (DT The) (JJ big) (JJ blue) (NN ball))
The head word here is the noun ball
, and it is modified by the adjectives big
and blue
. If this noun phrase was embedded in a sentence, you could delete those modifiers and still have something that was consistent with, but less specific than, the meaning of the original sentence.
Within noun phrases, you can generally delete the adjectives, nouns that are not the head, and nested prepositional phrases.
Within verb phrases and complete clauses, things get more tricky since deleting material that servers as an argument to the verb can completely change the interpretation a sentence. For example, deleting the book
from He sold Jim the book
results in He sold Jim
.
OpenNLP may do some of this for you. Phrase chunking and parsing should help you with this. However, this is not a particularly simple problem, and algorithms will tend to get confused as sentence structure becomes more complex and ambiguous. You should sometimes be able to reorder phrases within a sentence and maintain meaning.
精彩评论