I am a newbie to R and I have problem splitting a very large data frame into a nested list. I tried to look for help on the internet, but I was unsuccessful.
I have a simplified example on how my data are organized:
The headers are:
1 "station" (number)
2. "date.str" (date string)
3. "member"
4. "forecast time"
5. "data"
I am not sure my data example will show up rightly, but if so, it look like this:
1. station date.str member forecast.time data1
2. 6019 20110805 mbr000 06 77
3. 6031 20110805 mbr000 06 28
4. 6071 20110805 mbr000 06 45
5. 6019 20110805 mbr001 12 22
6. 6019 20110806 mbr024 18 66
I want to split the large data frame into a nested list after "station", "member", "date.str" and "forecast.time". So that mylist[[c(s,m,d,t)]] contains a data frame with data for station "s" and member "m" for date.str "d" and for forecast time "t" conserving the values of s, m, d and t.
My code is:
data.st <- list()
data.st.member <- list()
data.st.member.dato <- list()
data.st. <- split(mydata, mydata$station)
data.st.member <- lapply(data.st, FUN = fsplit.member)
(I created a function to split after "member")
#Loop over station number:
for (s in 1:S){
#Loop over members:
for (m in 1:length(members){
tmp <- split( data.st.member[[s]][[m]], data.st.member[[s]][[m]]$dato.str )
#Loop over number of diffe开发者_StackOverflow中文版rent "date.str"s
for (t in 1:length(no.date.str) ){
data.st.member.dato[[s]][[m]][[t]] <- tmp}
} #end m loop
} #end s loop
I would also like to split according to the forecast time: forec.time, but I didn't get that far.
I have tried a couple of different configurations within the loops, so I don't at the moment have a consistent error message. I can't figure out, what I am doing or thinking wrong.
Any help is much appreciated!
Regards Sisse
It's easier than you think. You can pass a list into split
in order to split on several factors.
Reproducible example
with(airquality, split(airquality, list(Month, Day)))
With your data
data.st <- with(mydata,
split(mydata, list("station", "member", "date.str", "forecast.time"))
)
Note: This doesn't give you a nested list like you asked for, but as Joran commented, you very probably don't want that. A flat list will be nicer to work with.
Speculating wildly: did you just want to calculate statistics on different chunks of data? If so, then see the many questions here on split-apply-combine problems.
I also want to echo the others in that this recursive data structure is going to be difficult to work with and probably there are better ways. Do look at the split-apply-combine approach as Richie suggested. However, the constraints may be external, so here is an answer using the plyr
library.
mylist <- dlply(mydata, .(station), dlply, .(memeber), dlply, .(date.str), dlply, .(forecast.time), identity)
Using the snippet of data you gave for mydata
,
> mylist[[c("6019","mbr000","20110805","6")]]
station date.str member forecast.time data1
1 6019 20110805 mbr000 6 77
精彩评论