开发者

Writing a Simple Triplet Matrix to a File?

开发者 https://www.devze.com 2023-01-07 18:08 出处:网络
I am using the tm package to compute term-document-matrix for a dataset, I now have to write the term-document-matrix to a file but when I use the write functions in R I am getting a error.

I am using the tm package to compute term-document-matrix for a dataset, I now have to write the term-document-matrix to a file but when I use the write functions in R I am getting a error.

Here is the code which I am using and the error I am getting:

data("crude")
tdm <- TermDocumentMatrix(crude, control = list(weighting = 开发者_如何学编程weightTfIdf, stopwords = TRUE))
dtm <- DocumentTermMatrix(crude, control = list(weighting = weightTfIdf, stopwords = TRUE))

and this is the error while I use the write.table command on this data:

Error in cat(list(...), file, sep, fill, labels, append) : argument 1 (type 'list') cannot be handled by 'cat'

I understand that tbm is a object of type Simple Triplet Matrix, but how can I write this to a simple text file.


I think I might be misunderstanding the question, but if all you want to do is export the term document matrix to a file, then how about this:

m <- inspect(tdm)
DF <- as.data.frame(m, stringsAsFactors = FALSE)
write.table(DF)

Is that what you're after mate?

Hope that helps a little,

Tony Breyal


Should the file be "human-readable"? If not, use dump, dput, or save. If so, convert your list into a data.frame.

Edit: You can convert your list into a matrix if each list element is equal length by doing matrix(unlist(list.name), nrow=length(list.name[[1]])) or something like that (or with plyr).

Why aren't you doing your SVM analysis in R (e.g. with kernlab)?

Edit 2: Ok, I looked at your data, and it isn't easy to convert into a matrix because the list elements aren't equal length:

> is.list(tdm)
[1] TRUE
> str(tdm)
List of 7
 $ i        : int [1:1475] 15 29 151 152 173 205 215 216 227 228 ...
 $ j        : int [1:1475] 1 1 1 1 1 1 1 1 1 1 ...
 $ v        : Named num [1:1475] 3.32 4.32 2.32 2 2.32 ...
  ..- attr(*, "names")= chr [1:1475] "1.50" "16.00" "barrel," "barrel." ...
 $ nrow     : int 985
 $ ncol     : int 20
 $ dimnames :List of 2
  ..$ Terms: chr [1:985] "(bpd)" "(bpd)." "(gcc)" "(it) appears to be nearing a crossroads with regard to\nderegulation, both as it pertains to investments and imports," ...
  ..$ Docs : chr [1:20] "127" "144" "191" "194" ...
 $ Weighting: chr [1:2] "term frequency - inverse document frequency" "tf-idf"
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"

In order to convert this to a matrix, you will need to either take elements of this list (e.g. i, j) or else do some other manipulation.

Edit 3: Just to conclude my commentary here: these objects are intended to be used with the inspect function (see the package vignette).

As discussed, in order to use a function like write.table, you will need to convert your list into a matrix, which requires some manipulation of that list such that you have several vectors of equal length. Looking at the structure of these tm objects: this will be very difficult to do, and I suggest you work with the helper functions that are included with that package.


dtmMatrix <- as.matrix(dtm)
write.csv(dtmMatrix, 'mydata.csv')

This certainly does the work. However, when I tried it on a very large DTM (25000 by 35000), it gave errors relating to lack of memory space.

I used the following method:

dtm <- DocumentTermMatrix(corpus)
dtm1 <- removeSparseTerms(dtm,0.998)   ##max allowed sparsity 0.998

m <- inspect(dtm1)
DF <- as.data.frame(m, stringsAsFactors = FALSE)
write.csv(DF,"mydata0.998sparse.csv")

Which reduced the size of the document term matrix to a great extent! Here you can increase the max allowable sparsity (closer to 1) to include more terms in DF.

0

精彩评论

暂无评论...
验证码 换一张
取 消