java keyword extraction_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-04 18:02 出处：网络

Is there a simple to use Java library that can take a String and return a set of Strings which are the keywords/keyphrases.

相关专题：keyword

Is there a simple to use Java library that can take a String and return a set of Strings which are the keywords/keyphrases.

It doesn't have to be particularly clever, just use stop words and stemming to match 开发者_运维技巧keywords.

I am looking at the KEA package http://code.google.com/p/kea-algorithm/ but I can't figure out how to use their code.

Ideally something simple which has a little example documentation would be good. In the meantime I will set about writing this myself!

EDIT: When I say I can't see how to figure out how to use their code, I mean I can't see a simple way. The individiual classes by themselves have useful methods that will do much of the work.

This is a fairly old question and probably the OP has already solved his problem, but putting it here for others who may stumble upon the question looking for how to use KEA.

For KEA, you will need a training set - some of your documents will need to have keywords already set. The training data consists of a directory of documents (.txt files) and corresponding keywords files (.key files), with one keyword per line. You train KEA on this set, then use the model to extract keywords on the rest of your documents, which are in another directory of .txt files. KEA will write out corresponding .key files in this directory.

For more information, take a look at one or more of the following:

1) The KEA source distribution has a TestKEA.java class which shows how to extract keywords from a small test corpus. The README has details on the directory format required.

2) This blog post has (a somewhat terse IMO) instructions on how to use KEA.

http://kea-pranay.blogspot.com/2010/02/kea-key-extraction-algorithm.html

3) My blog post which I wrote up last weekend while trying to learn how to generate keywords from a corpus I had (which were already manually annotated with keywords). It has Python code to pre-process data to the way KEA expects it, Scala (KEA provides a Java API) code to train and run the extractor, and Python code to do analyze and visualize the generated keywords.

http://sujitpal.blogspot.com/2014/08/keyword-extraction-with-kea.html

You might try the Porter Stemming algorithm: the java version is at http://tartarus.org/~martin/PorterStemmer/java.txt and the main page is at http://tartarus.org/~martin/PorterStemmer/. Its old, but doesn't do a bad job.