开发者

can NLTK/pyNLTK work "per language" (i.e. non-english), and how?

开发者 https://www.devze.com 2022-12-12 05:44 出处:网络
How can I tell NLTK to treat the text in a particular language? Once in a while I write a specialized NLP routine to do POS tagging, tokenizing and etc. on a non-english (but still hindo-European) te

How can I tell NLTK to treat the text in a particular language?

Once in a while I write a specialized NLP routine to do POS tagging, tokenizing and etc. on a non-english (but still hindo-European) text domain.

This question seem to address only different corpora, not the change in code/settings: POS tagging in German

Alternatively,are th开发者_运维百科ere any specialized Hebrew/Spanish/Polish NLP modules for python?


I'm not sure what you're referring to as the changes in code/settings. NLTK mostly relies on machine learning and the "settings" are usually extracted from the training data.

When it comes to POS tagging the results and tagging will be dependant on the tagger you use/train. Should you train your own you'll of course need some spanish / polish training data. The reason these might be hard to find is the lack of gold standard material publicly available. There are tools out there to do that do this, but this one isn't for python (http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/).

The nltk.tokenize.punkt.PunktSentenceTokenizer tokenizer will tokenize sentences according to multilingual sentence boundaries the details of which can be found in this paper (http://www.mitpressjournals.org/doi/abs/10.1162/coli.2006.32.4.485).

0

精彩评论

暂无评论...
验证码 换一张
取 消