开发者

Processing data in chunks

开发者 https://www.devze.com 2023-04-05 20:45 出处：网络

I have a dataset \"X\" of about 8 million observations and 5 characters variables - call them A, B, C, D and E. I am trying to calculate the jaro-winkler statistics between D and E with the RecordLink

相关专题：

I have a dataset "X" of about 8 million observations and 5 characters variables - call them A, B, C, D and E. I am trying to calculate the jaro-winkler statistics between D and E with the RecordLinkage package:

library(RecordLinkage)
X$jw = jarowinkler(X$D, X$E)

The problem is that more and memory keeps getting used up till the computer simply freezes. Is there any way of automatically doing the processing in "chunks", without actually having to manually split X into reasonably small sizes beforehand and working with the individual subsets?

In other words, is there any built-in function which does the splitting and processing without me havin开发者_Python百科g to do it upfront?

Well, the simplest solution would probably to be to use the nrows argument to read.table (or CSV or whatever). Set nrows to a small value, and then loop through the segments, removing unwanted objects and calling gc() as you go.