Open source tools for text clustering and auto summarization [closed]_问答_开发者

Open source tools for text clustering and auto summarization [closed]

开发者 https://www.devze.com 2023-02-14 17:29 出处：网络

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers. 开发者_开发技巧

We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.

Closed 7 years ago.

Improve this question

My latest project requires measuring similarities among text documents and give each of them some short title. Is there any open source library for those? Or if I must build it myself, is there any tutorial on the subjects? What tools I should use?

Measuring similarities between text documents you can start with older techniques of document vetor similarity (check vector space model). You can use latent semantic indexing for the same. Here is one paper on document similarities.

Text summarization is more difficult then similarity measures as you have to produce something meaningful to humans. OpenNLP is good library for all the basic related to text processing. More papers related to text summarization are here, may be good to start with.

You can measure similarity using one of the edit distance functions, there are implementations available for populate languages if you do a search, such as C# Leventshtein distance..

Similarity between documents can also be a problem of Information Retrieval, a popular library for which is Lucene. Lucene uses the vector space model to determine similarity between a document and a query and can also be used to measure similarity between two documents. There are implementations in Java and C# and ports to other languages as well.

The problem can also be that of natural language processing and among the libraries I have used are NLTK and LingPipe. These libraries are targeted at much more than similarity, they have a steep learning curve and may be overkill. However, these could be helpful in extracting a the short title for a document.