I'm processing huge data files (millions of lines each).
Before I start processing I'd like to get a count of the number of lines in the file, so I can then indicate how far along the pr开发者_如何学Pythonocessing is.
Because of the size of the files, it would not be practical to read the entire file into memory, just to count how many lines there are. Does anyone have a good suggestion on how to do this?
Reading the file a line at a time:
count = File.foreach(filename).inject(0) {|c, line| c+1}
or the Perl-ish
File.foreach(filename) {}
count = $.
or
count = 0
File.open(filename) {|f| count = f.read.count("\n")}
Will be slower than
count = %x{wc -l #{filename}}.split.first.to_i
If you are in a Unix environment, you can just let wc -l
do the work.
It will not load the whole file into memory; since it is optimized for streaming file and count word/line the performance is good enough rather then streaming the file yourself in Ruby.
SSCCE:
filename = 'a_file/somewhere.txt'
line_count = `wc -l "#{filename}"`.strip.split(' ')[0].to_i
p line_count
Or if you want a collection of files passed on the command line:
wc_output = `wc -l "#{ARGV.join('" "')}"`
line_count = wc_output.match(/^ *([0-9]+) +total$/).captures[0].to_i
p line_count
It doesn't matter what language you're using, you're going to have to read the whole file if the lines are of variable length. That's because the newlines could be anywhere and theres no way to know without reading the file (assuming it isn't cached, which generally speaking it isn't).
If you want to indicate progress, you have two realistic options. You can extrapolate progress based on assumed line length:
assumed lines in file = size of file / assumed line size
progress = lines processed / assumed lines in file * 100%
since you know the size of the file. Alternatively you can measure progress as:
progress = bytes processed / size of file * 100%
This should be sufficient.
using ruby:
file=File.open("path-to-file","r")
file.readlines.size
39 milliseconds faster then wc -l on a 325.477 lines file
Summary of the posted solutions
require 'benchmark'
require 'csv'
filename = "name.csv"
Benchmark.bm do |x|
x.report { `wc -l < #{filename}`.to_i }
x.report { File.open(filename).inject(0) { |c, line| c + 1 } }
x.report { File.foreach(filename).inject(0) {|c, line| c+1} }
x.report { File.read(filename).scan(/\n/).count }
x.report { CSV.open(filename, "r").readlines.count }
end
File with 807802 lines:
user system total real
0.000000 0.000000 0.010000 ( 0.030606)
0.370000 0.050000 0.420000 ( 0.412472)
0.360000 0.010000 0.370000 ( 0.374642)
0.290000 0.020000 0.310000 ( 0.315488)
3.190000 0.060000 3.250000 ( 3.245171)
DISCLAIMER: the already existing benchmark used count
rather than length
or size
, and
was tedious to read IMHO. Hence this new answer.
Benchmark
require "benchmark"
require "benchmark/ips"
require "csv"
filename = ENV.fetch("FILENAME")
Benchmark.ips do |x|
x.report("wc") { `wc -l #{filename}`.to_i }
x.report("open") { File.open(filename).inject(0, :next) }
x.report("foreach") { File.foreach(filename).inject(0, :next) }
x.report("foreach $.") { File.foreach(filename) {}; $. }
x.report("read.scan.length") { File.read(filename).scan(/\n/).length }
x.report("CSV.open.readlines") { CSV.open(filename, "r").readlines.length }
x.report("IO.readlines.length") { IO.readlines(filename).length }
x.compare!
end
On my MacBook Pro (2017) with a 2.3 GHz Intel Core i5 processor:
Warming up --------------------------------------
wc 8.000 i/100ms
open 2.000 i/100ms
foreach 2.000 i/100ms
foreach $. 2.000 i/100ms
read.scan.length 2.000 i/100ms
CSV.open.readlines 1.000 i/100ms
IO.readlines.length 2.000 i/100ms
Calculating -------------------------------------
wc 115.014 (±21.7%) i/s - 552.000 in 5.020531s
open 22.450 (±26.7%) i/s - 104.000 in 5.049692s
foreach 32.669 (±27.5%) i/s - 150.000 in 5.046793s
foreach $. 25.244 (±31.7%) i/s - 112.000 in 5.020499s
read.scan.length 44.102 (±31.7%) i/s - 190.000 in 5.033218s
CSV.open.readlines 2.395 (±41.8%) i/s - 12.000 in 5.262561s
IO.readlines.length 36.567 (±27.3%) i/s - 162.000 in 5.089395s
Comparison:
wc: 115.0 i/s
read.scan.length: 44.1 i/s - 2.61x slower
IO.readlines.length: 36.6 i/s - 3.15x slower
foreach: 32.7 i/s - 3.52x slower
foreach $.: 25.2 i/s - 4.56x slower
open: 22.4 i/s - 5.12x slower
CSV.open.readlines: 2.4 i/s - 48.02x slower
This was made with a file containing 75 516 lines, and 3 532 510 characters (~47 chars per line). You should try this with your own file/dimensions and computer for a precise result.
Same as DJ's answer, but giving the actual Ruby code:
count = %x{wc -l file_path}.split[0].to_i
The first part
wc -l file_path
Gives you
num_lines file_path
The split
and to_i
put that into a number.
For reasons I don't fully understand, scanning the file for newlines using File
seems to be a lot faster than doing CSV#readlines.count
.
The following benchmark used a CSV file with 1,045,574 lines of a data and 4 columns:
user system total real
0.639000 0.047000 0.686000 ( 0.682000)
17.067000 0.171000 17.238000 ( 17.221173)
The code for the benchmark is below:
require 'benchmark'
require 'csv'
file = "1-25-2013 DATA.csv"
Benchmark.bm do |x|
x.report { File.read(file).scan(/\n/).count }
x.report { CSV.open(file, "r").readlines.count }
end
As you can see, scanning the file for newlines is an order of magnitude faster.
I have this one liner.
puts File.foreach('myfile.txt').count
The test results for more than 135k lines are shown below. This is my benchmark code.
file_name = '100m.csv'
Benchmark.bm do |x|
x.report { File.new(file_name).readlines.size }
x.report { `wc -l "#{file_name}"`.strip.split(' ')[0].to_i }
x.report { File.read(file_name).scan(/\n/).count }
end
result is
user system total real
0.100000 0.040000 0.140000 ( 0.143636)
0.000000 0.000000 0.090000 ( 0.093293)
0.380000 0.060000 0.440000 ( 0.464925)
The wc -l
code has one problem.
If there is only one line in the file and the last character does not end with \n
, then count is zero.
So, I recommend calling wc when you count more then one line.
If the file is a CSV file, the length of the records should be pretty uniform if the content of the file is numeric. Wouldn't it make sense to just divide the size of the file by the length of the record or a mean of the first 100 records.
With UNIX style text files, it's very simple
f = File.new("/path/to/whatever")
num_newlines = 0
while (c = f.getc) != nil
num_newlines += 1 if c == "\n"
end
That's it. For MS Windows text files, you'll have to check for a sequence of "\r\n" instead of just "\n", but that's not much more difficult. For Mac OS Classic text files (as opposed to Mac OS X), you would check for "\r" instead of "\n".
So, yeah, this looks like C. So what? C's awesome and Ruby is awesome because when a C answer is easiest that's what you can expect your Ruby code to look like. Hopefully your dain hasn't already been bramaged by Java.
By the way, please don't even consider any of the answers above
that use the IO#read
or IO#readlines
method in turn calling a
String method on what's been read. You said you didn't want to
read the whole file into memory and that's exactly what these do.
This is why Donald Knuth recommends people understand how to program
closer to the hardware because if they don't they'll end up writing
"weird code". Obviously you don't want to code close to the
hardware whenever you don't have to, but that should be common sense.
However you should learn to recognize the instances which you do have
to get closer to the nuts and bolts such as this one.
And don't try to get more "object oriented" than the situation calls for. That's an embarrassing trap for newbies who want to look more sophisticated than they really are. You should always be glad for the times when the answer really is simple, and not be disappointed when there's no complexity to give you the opportunity to write "impressive" code. However if you want to look somewhat "object oriented" and don't mind reading an entire line into memory at a time (i.e., you know the lines are short enough), you can do this
f = File.new("/path/to/whatever")
num_newlines = 0
f.each_line do
num_newlines += 1
end
This would be a good compromise but only if the lines aren't too long in which case it might even run more quickly than my first solution.
wc -l
in Ruby with less memory, the lazy way:
(ARGV.length == 0 ?
[["", STDIN]] :
ARGV.lazy.map { |file_name|
[file_name, File.open(file_name)]
})
.map { |file_name, file|
"%8d %s\n" % [*file
.each_line
.lazy
.map { |line| 1 }
.reduce(:+), file_name]
}
.each(&:display)
as originally shown by Shugo Maeda.
Example:
$ curl -s -o wc.rb -L https://git.io/vVrQi
$ chmod u+x wc.rb
$ ./wc.rb huge_data_file.csv
43217291 huge_data_file.csv
Using foreach
without inject
is about 3% faster than with inject
. Both are very much faster (more than 100x in my experience) than using getc
.
Using foreach
without inject
can also be slightly simplified (relative to the snippet given elsewhere in this thread) as follows:
count = 0; File.foreach(path) { count+=1}
puts "count: #{count}"
You can read the last line only and see its number:
f = File.new('huge-file')
f.readlines[-1]
count = f.lineno
精彩评论