开发者

How to build a IMS open source corpus workbench and NLTK readable corpus?

开发者 https://www.devze.com 2023-02-12 07:25 出处：网络

Currently i\'ve a bunch of .txtfiles. within each .txt files, each sentence is separated by newline. how do i change it to the IMS CWB format so that it\'s readable by CWB? and also to nltk format.

相关专题：corpus nltk python

Currently i've a bunch of .txtfiles. within each .txt files, each sentence is separated by newline. how do i change it to the IMS CWB format so that it's readable by CWB? and also to nltk format.

Can someone lead me to a howto page to do that? or is there a guide page to do that, i've tried reading through the manual but i dont really know. www.cwb.sourceforge.net/files/CWB_Encoding_Tutorial.pdf

Does it mean i create a data and registry directory and then i run the cwb-encode command and it will be all con开发者_如何学编程verted to vrt file? does it convert one file at a time? how do i script it to run through multiple file in a directory?

It's easy to produce cwb's "verticalized" format from an NLTK-readable corpus:

from nltk.corpus import brown

out = open('corpus.vrt','w')
for sentence in nltk.brown.sents():
     print >>out,'<s>'
     for word in sentence:
          print >>out,word
     print >>out,'</s>'
out.close()

From there, you can follow the instructions on the CWB website.