Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this questionQuestion is maybe ( about 100%) subjective but I need advices. What is best language for natural language processing ? I know Java and C++ but is there easier way to do it. To be more specific I need to process texts from lot of sites and get information.
As I said in comments, the question is not about a language, but about suitable library. And there are a lot of NLP libraries in both Java and C++. I believe you must inspect some of them (in both languages) and then, when you will know all the plenty of available libraries, create some kind of "big plan", how to implement your task. So, here I'll just give you some links with a brief explanation what is what.
Java
GATE - it is exactly what its name means - General Architecture for Text Processing. Application in GATE is a pipeline. You put language processing resources like tokenizers, POS-taggers, morphological analyzers, etc. on it and run the process. The result is represented as a set of annotations - meta information, attached to a peace of text (e.g. token). In addition to great number of plugins (including plugins for integration with other NLP resources like WordNet or Stanford Parser), it has many predefined dictionaries (cities, names, etc.) and its own regex-like language JAPE. GATE comes with its own IDE (GATE Developer), where you can try your pipeline setup, and then save it and load from Java code.
UIMA - or Unstructured Information Management Applications. It is very similar to GATE in terms of architecture. It also represents pipeline and produces set of annotations. Like GATE, it has visual IDE, where you can try out your future application. The difference is that UIMA mostly concerns information extraction while GATE performs text processing without explicit consideration of its purpose. Also UIMA comes with simple REST server.
OpenNLP - they call themselves organization center for open source projects on NLP, and this is the most appropriate definition. Main direction of development is to use machine learning algorithms for the most general NLP tasks like part-of-speech tagging, named entity recognition, coreference resolution and so on. It also has good integration with UIMA, so its tools are also available.
Stanford NLP - probably best choice for engineers and researchers with NLP and ML knowledge. Unlike libraries like GATE and UIMA, it doesn't aim to provide as much tools as possible, but instead concentrates on idiomatic models. E.g. you don't have comprehensive dictionaries, but you can train probabilistic algorithm to create it! In addition to its CoreNLP component, that provides most wildly used tools like tokenization, POS tagging, NER, etc., it has several very interesting subprojects. E.g. their Dependency framework allows you to extract complete sentence structure. That is, you can, for example, easily extract information about subject and object of a verb in question, which is much harder using other NLP tools.
C++
UIMA - yes, there are complete implementations for both Java and C++.
Stanford Parser - some Stanford's projects are only in Java, others - only in C++, and some of them are available in both languages. You can find many of them here.
APIs
A number of web service APIs perform specific language processing, including:
Alchemy API - language identification, named entity recognition, sentiment analysis and much more! Take a look at their main page - it is quite self-descriptive.
OpenCalais - this service tries to build giant graph of everything. You pass it a web page URL and it enriches this page text with found entities, together with relations between them. For example, you pass it a page with "Steve Jobs" and it returns "Apple Inc." (roughly speaking) together with probability that this is the same Steve Jobs.
Other recommendations
And yes, you should definitely take a look at Python's NLTK. It is not only a powerful and easy-to-use NLP library, but also a part of excellent scientific stack created by extremely friendly community.
Update (2017-11-15): 7 years later there are even more impressive tools, cool algorithms and interesting tasks. One comprehensive description may be found here:
https://tomassetti.me/guide-natural-language-processing/
Python and NLTK
ScalaNLP, which is a Natural Language Processing library written in Scala, seems suitable for your job.
I would recommend Python and NLTK.
Some hints and notes I can pinpoint based on my experience using it:
- Python has efficient list, strings handling. You can index lists very efficiently what in natural language should be a fact. Also has nice syntactic delicacies, for example to access the first 100 words of a list, you can index as list[:100] (compare it with stl in c++).
- Python serialization is easy and native. The serialization modules make language processing corpus and text handling an easy task, one line of code.(compare it with the several lines using Boost or other libraries of C++)
- NLTK provides classes for loading corpus, processing it, tagging, tokenization, grammars parsing, chunking, and a whole set of machine learning algorithms, among other stuff. Also it provides good resources for probabilistic models based on words distribution in text. http://www.nltk.org/book
If learning a new programming language is an obstacle, you can check openNLP for Java http://incubator.apache.org/opennlp/
精彩评论