开发者

Can extract generic entities using Lingpipe other than People, Org and Loc?

开发者 https://www.devze.com 2023-04-07 10:44 出处:网络
I have read through Lingpipe for N开发者_开发问答LP and found that we have a capability there to identify mentions of names of people, locations and organizations. My questions is that if I have a tra

I have read through Lingpipe for N开发者_开发问答LP and found that we have a capability there to identify mentions of names of people, locations and organizations. My questions is that if I have a training set of documents that have mentions of let's say software projects inside the text, can I use this training set to train a named entity recognizer? Once the training is complete, I should be able to feed a test set of textual documents to the trained model and I should be able to identify mentions of software projects there.

Is this generic NER possible using NER? If so, what features should I be using that I should feed?

Thanks Abhishek S


Provided that you have enough training data with tagged software projects that would be possible.

If using Lingpipe, I would use character n-grams model as the first option for your task. They are simple and usually do the work. If results are not good enough some of the standard NER features are:

  • tokens
  • part of speech (POS)
  • capitalization
  • punctuaction
  • character signatures: these are some ideas: ( LUCENE -> AAAAAA -> A) , (Lucene -> Aaaaaa -> Aa ), (Lucene-core --> Aaaaa-aaaa --> Aa-a)
  • it may also be useful to compose a gazzeteer (list of software projects) if you can obtain that from Wikipedia, sourceforge or any other internal resource.

Finally, for each token you could add contextual features, tokens before the current one (t-1, t-2...), tokens after the current one (t+1,t+2...) as well as their bigram combinations (t-2^t-1), (t+1^t+2).


Of course you can. Just get train data with all categories you need and follow tutorial http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html. No feature tuning is required since lingpipe uses only hardcoded one (shapes, sequnce word and ngramms)

0

精彩评论

暂无评论...
验证码 换一张
取 消