I have over 100 survey data files with the following filename structure in a common director开发者_StackOverflow中文版y:
BD-1994.rdta
BD-1996.rdta
BD-1999.rdta
BD-2004.rdta
BF-1992.rdta
...
UG-1988.rdta
UG-1995.rdta
UG-2001.rdta
VN-1992.rdta
VN-1997.rdta
The leading two letters (eg "BD") represent a specific country (by its ISO code) and the four digits represent the year of a given survey.
I would like to process these data so I can create one multi-line, time-series graph of fertility rates per country where each line represents a year of the survey. For example, the first graph will be for "BD" (Bangladesh) and will display four time-series for years 1994, 1996, 1999, and 2004.
The structure of the individual files is as follows:
time fertility
1 3.2
2 2.6
... ...
7 2.4
My idea at the moment is to use rbind within a for loop and create one massive dataset with all the data in it. Then I need to split the data neatly by country code, perhaps using a function like "subset" (but doesn't look like subset is the right tool for the job.
Any suggestions on how to perform this data management so I can then call the plot function in R on a dataframe that contains the survey data for all years within a given country?
Thank you
Here is one approach using ggplot2
and plyr
. The basic idea is to create two helper functions to (a) extract data from each rdata
file into a data frame and (b) plot time series for each country. Once these functions are defined, it is relatively straightforward to use plyr
functions to loop through the files to produce the required graphs. I would suggest that you run this code on your data, and report back with any errors that you get, since I am unable to test my code in the absence of any data.
require(plyr)
# function to extract data frame from each rdata file
get_data_frame = function(file_name){
temp_env = new.env()
load(file_name, temp_env)
mydata = get(ls(envir = temp_env), temp_env)
country = substr(file_name, 1, 2)
year = substr(file_name, 4, 7)
df = data.frame(mydata, country, year)
return(df)
}
# function to save time series plot of fertility grouped by year
plot_country_data = function(country_df){
require(ggplot2)
p1 = ggplot(country_df, aes(x = time, y = fertility)) +
geom_line(aes(group = year))
ggsave(filename = paste(country_df, ".pdf", sep = ""))
}
# extract all rdata files in working directory
rdata_files = list.files(pattern = 'rdata')
# consolidate data into one big data frame
big_data = ldply(rdata_files, get_data_frame)
# plot data for each country and save as pdf
d_ply(big_data, .(country), plot_country_data)
精彩评论