The software I am using produces log files with a variable number of lines of summary information followed by lots of tab delimited data. I am trying to write a function that will read the data from these log files into a data frame ignoring the summary information. The summary information never contains a tab, so the following function works:
read.parameters <- function(file.name, ...){
lines <- scan(file.name, what="character", sep="\n")
first.line <- min(grep("\\t", lines))
return(read.delim(file.name, skip=first.line-1, ...))
}
However, these logfiles are quite big, and so reading the file twice is very slow. Surely there is a better way?
Edited to add:
Marek suggested using a textConnection
object. The way he suggested in the answer fails on a big file, but the following works:
read.parameters <- function(file.name, ...){
conn = file(file.name, "r")
on.exit(开发者_开发技巧close(conn))
repeat{
line = readLines(conn, 1)
if (length(grep("\\t", line))) {
pushBack(line, conn)
break}}
df <- read.delim(conn, ...)
return(df)}
Edited again: Thanks Marek for further improvement to the above function.
You don't need to read twice. Use textConnection
on first result.
read.parameters <- function(file.name, ...){
lines <- scan(file.name, what="character", sep="\n") # you got "tmp.log" here, i suppose file.name should be
first.line <- min(grep("\\t", lines))
return(read.delim(textConnection(lines), skip=first.line-1, ...))
}
If you can be sure that the header info won't be more than N lines, e.g. N = 200, then try:
scan(..., nlines = N)
That way you won't re-read more than N lines.
精彩评论