I have a corpus of documents and I want to represent each document as a vector. Basically, the vector would have 1 for words that are present inside a document and for other words (which are present in ot开发者_JS百科her documents in the corpus and not in this particular document) it would have a 0. How do I create this vector for all the documents in Weka?
Is there a quick way to do this using Weka? I also want Weka to remove stopwords and so some pre-processing if possible before it creates this vector.
Thanks Abhishek S
You want the StringToWordVector filter.
It has options for binary occurrence and stopping, amongst many others, such as stemming, truncating the word list, discarding infrequent terms, case folding.
精彩评论