I have a dataset "X" of about 8 million observations and 5 characters variables - call them A, B, C, D and E. I am trying to calculate the jaro-winkler statistics between D and E with the RecordLinkage
package:
library(RecordLinkage)
X$jw = jarowinkler(X$D, X$E)
The problem is that more and memory keeps getting used up till the computer simply freezes. Is there any way of automatically doing the processing in "chunks", without actually having to manually split X into reasonably small sizes beforehand and working with the individual subsets?
In other words, is there any built-in function which does the splitting and processing without me havin开发者_Python百科g to do it upfront?
Well, the simplest solution would probably to be to use the nrows
argument to read.table
(or CSV or whatever). Set nrows to a small value, and then loop through the segments, removing unwanted objects and calling gc()
as you go.
精彩评论