Detect Proper Nouns with WordNet?_问答_开发者_运维开发者技术经验分享

I'm using JAWS to access WordNet. Given a word, is there any way to detect if it is a proper noun? It looks like the synsets have pretty coarse lexical categories.

To clarify, there is no context for the words - they are just presented individually. If a word could con开发者_StackOverflow社区ceivably be used as a common noun, it is acceptable. So "mark" is fine, because although it could be someone's name it could also refer to a point. However, "Africa" is not.

Unfortunately, you're not going to be able to reliably determine proper noun information from WordNet synsets. What you are looking for is Named Entity Recognition. There are links to several versions available in Java from the wikipedia page. I would personally recommend Stanford NER or LingPipe.

Updated:

Based on the added constraint of no context for words, you could use capitalization as the primary indicator and then double check WordNet to see if the word can be used as a noun. Perhaps something like this:

String word = "foo";
boolean isProperNoun = false;
if (Character.isUpperCase(word.charAt(0))) {
    WordNetDatabase database = WordNetDatabase.getFileInstance();
    Synset[] synsets = database.getSynsets(word, SynsetType.NOUN);
    isProperNoun = synsets.length > 0;
}

That would eliminate false positives like this:

If you build it...
As you wish...
Oh Romeo, Romeo...

And still catch just the capitalized nouns in

In the Book of Mark it says...
Have you heard The Roots or The Who recently?

but still give you false positives on

Mark the first instance...
Book 'em, Danno.

because they could be, but without context you don't know.

If you wanted to get really tricky, you could follow up the hypernym tree on any noun to see if you reached something obvious like 'company' or 'country'. However, the last time I was working with WordNet (4 years ago), the hypernym/hyponym relationships were not very reliable or consistent, which could cause a lot of false negatives (and without improving the false positives I mentioned above because those are completely context dependent).

If you use the linux command-line to use Wordnet, you can use 'wn -synsn' to get all the synsets of a word. The proper nouns will be capitalized. E.g.,

$: wn mark -synsn

   Synonyms/Hypernyms (Ordered by Estimated Frequency) of noun mark
   15 senses of mark                                                       

   Sense 1
   mark, grade, score
         => evaluation, valuation, rating
   .
   .
   .
   Sense 8
   Mark, Saint Mark, St. Mark
         INSTANCE OF=> Apostle, Apostelic Father
         INSTANCE OF=> Evangelist
         INSTANCE OF=> saint

But, seriously, please don't rely only on Wordnet for this. There are potentially gazillions of proper nouns for which Wordnet will not fetch you any information. Try the name Henrik, for example!

You can, however, build a context for your word w from datasets like the Google n-gram corpus, and use such contexts to build a classifier that returns a confidence score (i.e., the classifier can say w is a proper noun with 0 <= c <= 1 confidence.)

Let me run this past you. You might have to do a run through some more books on English to gain insight into the fact that one cannot determine a word's part of speech out of context.

The best you could do is test for exclusion ... determining that WordNet knows of no usage in a given part of speech. In some cases you might find that only one part of speech is listed in WordNet. For example I know of no usage of "car" other than as a noun.

Distinguishing proper nouns from common ones is even more difficult. Certainly you can use the heuristic ... a noun which is not the initial word of a sentence and is capitalized but not in ALLCAPS is probably a proper noun.

Ultimately, the distinction is one of semantics rather than lexical analysis. I doubt you'll find a reasonably robust solution based on looking up words in WordNet. I think you'll need to do natural language grammatic parsing before you'll be able to reliably extract nouns, much less detect proper nouns in prose.

That information doesn't seem to be specially stored in WordNet. You can however, look at the first word form of a noun sysnet to see if it's capitalized. Not sure how official that is but it seems to work telling that fly is not a proper noun and France is.