开发者

How to create a bag of words using Weka?

开发者 https://www.devze.com 2023-04-12 02:10 出处:网络
I have a corpus of documents and I want to represent each document as a vector. Basically, the vector would have 1 for words that are present inside a document and for other words (which are present i

I have a corpus of documents and I want to represent each document as a vector. Basically, the vector would have 1 for words that are present inside a document and for other words (which are present in ot开发者_JS百科her documents in the corpus and not in this particular document) it would have a 0. How do I create this vector for all the documents in Weka?

Is there a quick way to do this using Weka? I also want Weka to remove stopwords and so some pre-processing if possible before it creates this vector.

Thanks Abhishek S


You want the StringToWordVector filter.

It has options for binary occurrence and stopping, amongst many others, such as stemming, truncating the word list, discarding infrequent terms, case folding.

0

精彩评论

暂无评论...
验证码 换一张
取 消