Is there anything I can do to get partial results from after bumping into errors in a big file? I am using the following command to import d开发者_如何学编程ata from files. This is the fastest way I know, but it's not robust. It can easily screw up everything because of a small error. I hope at least there is way that scan(or any reader) can quickly return which row/line has the error, or partial results it read (than I will have an idea where the error is). Then, I can skip enough lines to recover over 99% good data.
rawData = scan(file = "rawData.csv", what = scanformat, sep = ",", skip = 1, quiet = TRUE, fill = TRUE, na.strings = c("-", "NA", "Na","N"))
All importing data tutorials I found seem to assume the files are in good shape. I didn't find a useful hint to deal with dirty files.
I will sincerely appreciate any hint or suggestion! It was really frustrating.
Idea1: Open a file connection (with file
function) and then scan
line by line (with nlines=1
). Put each scan
into try
to recover after reading a bad line.
Idea2: Use readLines
to read the file in raw format; then use strsplit
to parse. You can analyse this output to find bad lines and remove it.
The count.fields
function will preprocess a table like file and give you how many fields it found on each line (in the sense that read.table will look for fields). This is often a quick way to identify lines that have a problem because they will show a different number of fields from what is expected (or just different from the majority of other lines).
精彩评论