开发者

Creating an R dataframe row-by-row

开发者 https://www.devze.com 2023-01-14 16:27 出处:网络
I would like to construct a dataframe row-by-row in R. I\'ve done some searching, and all I came up with is the suggest开发者_StackOverflow社区ion to create an empty list, keep a list index scalar, th

I would like to construct a dataframe row-by-row in R. I've done some searching, and all I came up with is the suggest开发者_StackOverflow社区ion to create an empty list, keep a list index scalar, then each time add to the list a single-row dataframe and advance the list index by one. Finally, do.call(rbind,) on the list.

While this works, it seems very cumbersome. Isn't there an easier way for achieving the same goal?

Obviously I refer to cases where I can't use some apply function and explicitly need to create the dataframe row by row. At least, is there a way to push into the end of a list instead of explicitly keeping track of the last index used?


You can grow them row by row by appending or using rbind().

That does not mean you should. Dynamically growing structures is one of the least efficient ways to code in R.

If you can, allocate your entire data.frame up front:

N <- 1e4  # total number of rows to preallocate--possibly an overestimate

DF <- data.frame(num=rep(NA, N), txt=rep("", N),  # as many cols as you need
                 stringsAsFactors=FALSE)          # you don't know levels yet

and then during your operations insert row at a time

DF[i, ] <- list(1.4, "foo")

That should work for arbitrary data.frame and be much more efficient. If you overshot N you can always shrink empty rows out at the end.


One can add rows to NULL:

df<-NULL;
while(...){
  #Some code that generates new row
  rbind(df,row)->df
}

for instance

df<-NULL
for(e in 1:10) rbind(df,data.frame(x=e,square=e^2,even=factor(e%%2==0)))->df
print(df)


This is a silly example of how to use do.call(rbind,) on the output of Map() [which is similar to lapply()]

> DF <- do.call(rbind,Map(function(x) data.frame(a=x,b=x+1),x=1:3))
> DF
  x y
1 1 2
2 2 3
3 3 4
> class(DF)
[1] "data.frame"

I use this construct quite often.


The reason I like Rcpp so much is that I don't always get how R Core thinks, and with Rcpp, more often than not, I don't have to.

Speaking philosophically, you're in a state of sin with regards to the functional paradigm, which tries to ensure that every value appears independent of every other value; changing one value should never cause a visible change in another value, the way you get with pointers sharing representation in C.

The problems arise when functional programming signals the small craft to move out of the way, and the small craft replies "I'm a lighthouse". Making a long series of small changes to a large object which you want to process on in the meantime puts you square into lighthouse territory.

In the C++ STL, push_back() is a way of life. It doesn't try to be functional, but it does try to accommodate common programming idioms efficiently.

With some cleverness behind the scenes, you can sometimes arrange to have one foot in each world. Snapshot based file systems are a good example (which evolved from concepts such as union mounts, which also ply both sides).

If R Core wanted to do this, underlying vector storage could function like a union mount. One reference to the vector storage might be valid for subscripts 1:N, while another reference to the same storage is valid for subscripts 1:(N+1). There could be reserved storage not yet validly referenced by anything but convenient for a quick push_back(). You don't violate the functional concept when appending outside the range that any existing reference considers valid.

Eventually appending rows incrementally, you run out of reserved storage. You'll need to create new copies of everything, with the storage multiplied by some increment. The STL implementations I've use tend to multiply storage by 2 when extending allocation. I thought I read in R Internals that there is a memory structure where the storage increments by 20%. Either way, growth operations occur with logarithmic frequency relative to the total number of elements appended. On an amortized basis, this is usually acceptable.

As tricks behind the scenes go, I've seen worse. Every time you push_back() a new row onto the dataframe, a top level index structure would need to be copied. The new row could append onto shared representation without impacting any old functional values. I don't even think it would complicate the garbage collector much; since I'm not proposing push_front() all references are prefix references to the front of the allocated vector storage.


I've found this way to create dataframe by raw without matrix.

With automatic column name

df<-data.frame(
        t(data.frame(c(1,"a",100),c(2,"b",200),c(3,"c",300)))
        ,row.names = NULL,stringsAsFactors = FALSE
    )

With column name

df<-setNames(
        data.frame(
            t(data.frame(c(1,"a",100),c(2,"b",200),c(3,"c",300)))
            ,row.names = NULL,stringsAsFactors = FALSE
        ), 
        c("col1","col2","col3")
    )


Dirk Eddelbuettel's answer is the best; here I just note that you can get away with not pre-specifying the dataframe dimensions or data types, which is sometimes useful if you have multiple data types and lots of columns:

row1<-list("a",1,FALSE) #use 'list', not 'c' or 'cbind'!
row2<-list("b",2,TRUE)  

df<-data.frame(row1,stringsAsFactors = F) #first row
df<-rbind(df,row2) #now this works as you'd expect.


If you have vectors destined to become rows, concatenate them using c(), pass them to a matrix row-by-row, and convert that matrix to a dataframe.

For example, rows

dummydata1=c(2002,10,1,12.00,101,426340.0,4411238.0,3598.0,0.92,57.77,4.80,238.29,-9.9)
dummydata2=c(2002,10,2,12.00,101,426340.0,4411238.0,3598.0,-3.02,78.77,-9999.00,-99.0,-9.9)
dummydata3=c(2002,10,8,12.00,101,426340.0,4411238.0,3598.0,-5.02,88.77,-9999.00,-99.0,-9.9)

can be converted to a data frame thus:

dummyset=c(dummydata1,dummydata2,dummydata3)
col.len=length(dummydata1)
dummytable=data.frame(matrix(data=dummyset,ncol=col.len,byrow=TRUE))

Admittedly, I see 2 major limitations: (1) this only works with single-mode data, and (2) you must know your final # columns for this to work (i.e., I'm assuming that you're not working with a ragged array whose greatest row length is unknown a priori).

This solution seems simple, but from my experience with type conversions in R, I'm sure it creates new challenges down-the-line. Can anyone comment on this?


Depending on the format of your new row, you might use tibble::add_row if your new row is simple and can specified in "value-pairs". Or you could use dplyr::bind_rows, "an efficient implementation of the common pattern of do.call(rbind, dfs)".

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号