开发者

How can I create my own corpus in the Python Natural Language Toolkit? [duplicate]

开发者 https://www.devze.com 2022-12-19 03:32 出处:网络
This question already has answers here: Creating a new corpus with NLTK (4 answers) Closed 9 years ago. I have recently expanded the names corpus in nltk and would like to know how I can
This question already has answers here: Creating a new corpus with NLTK (4 answers) Closed 9 years ago.

I have recently expanded the names corpus in nltk and would like to know how I can turn the two files I have (male.txt, female.txt) in to a corpus so I can access them using the existing nltk.corpus meth开发者_运维技巧ods. Does anyone have any suggestions?

Many thanks, James.


As the readme says, the names corpus is not in the public domain -- you should send an email with any changes you make to the corpus author (address is in that file). Apart from that detail of law and courtesy, you can simply replace either or both of those files with your own, they're in perfectly simple format (one name per line, comments allowed [[and ignored]] and start with '#').

To install a totally new corpus rather than just tweaking an existing ones, you could start with the docs given here.


Came to understand how corpus reading works by looking at the source code in nltk.corpus and then looking at the corpora (located in /home/[user]/nltk_data/corpora/names - this will probably be in My Documents for XP and somewhere in User for Win7 users).

The structure of the corpus and its related function will give a good understanding of how to use the different corpora available in NLTK.

In my case I looked at the names variable in nltk.corpus' source code and was interested in the WordListCorpusReader function as the names corpus is simply a list of words.


Alex is right, start with the docs, and figure out which corpus reader will work for your corpus. The simple instantiate it, given the path to your corpus file(s). As you'll see in the docs, the builtin corpora are simply instances of particular corpus reader classes. Look thru the code in the nltk.corpus package should be helpful as well.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号