Best practices for storing and using data frames too large for memory?_问答_开发者

Best practices for storing and using data frames too large for memory?

开发者 https://www.devze.com 2022-12-13 19:38 出处：网络

I\'m working with a large data frame, and have run up against RAM limits. At this point, I probably need to work with a serialized version on the disk. There are a few packages to support out-of-memor

I'm working with a large data frame, and have run up against RAM limits. At this point, I probably need to work with a serialized version on the disk. There are a few packages to support out-of-memory operations, but I'm not sure which one will suit my needs. I'd prefer to keep everything in data frames, so the ff package looks encouraging, but there are still compatibility problems开发者_StackOverflow that I can't work around.

What's the first tool to reach for when you realize that your data has reached out-of-memory scale?

You probably want to look at these packages:

ff for 'flat-file' storage and very efficient retrieval (can do data.frames; different data types)
bigmemory for out-of-R-memory but still in RAM (or file-backed) use (can only do matrices; same data type)
biglm for out-of-memory model fitting with lm() and glm()-style models.

and also see the High-Performance Computing task view.

I would say the disk.frame is good candidate for these type of tasks. I am the primary author of the package.

Unlike ff and bigmemory which restricts what data types can be easily handled, it tries to "mimic" data.frames and provide dplyr verbs for manipulating the data.

If you are dealing with Memory issues you should try following steps:

Clear additional process which consume RAM. Make sure you don't open browser with many tabs as they seem to consume a lot of RAM.
After done with step1, understand the structure of your dataset file. For that purpose, use read.csv(nrow=100). By doing this you will come to know what are columns and column structure. If you find any column not useful then remove it.
Once you know the column structure (colclasses) you can import the entire dataframe in one go..

Here is the sample code:

initial <- read.table("datatable.txt", nrows = 100)
classes <- sapply(initial, class)
tabAll <- read.table("datatable.txt", colClasses = classes)

Use fread() to read large data-frames.
If still it doesn't solve the problem then segment the dataset into two parts divide number of rows into two equal parts and then merge them after applying Dimensionality reduction Technique.

I hope it helps.