Yesterday, this piece of code caused me a headache. I fixed it by开发者_运维知识库 reading the file line by line. Any ideas ?
The while loop never seems to get executed even though the no of lines in the file is greater than 1.
val lines = Source.fromFile( new File("file.txt") ).getLines;
println( "total lines:"+lines.size );
var starti = 1;
while( starti < lines.size ){
val nexti = Math.min( starti + 10, lines.size );
println( "batch ("+starti+", "+nexti+") total:" + lines.size )
val linesSub = lines.slice(starti, nexti)
//do something with linesSub
starti = nexti
}
This is indeed tricky, and I would even say it's a bug in Iterator
. getLines
returns an Iterator
which proceeds lazily. So what seems to happen is that if you ask for lines.size
the iterator goes through the whole file to count the lines. Afterwards, it's "exhausted":
scala> val lines = io.Source.fromFile(new java.io.File("....txt")).getLines
lines: Iterator[String] = non-empty iterator
scala> lines.size
res4: Int = 15
scala> lines.size
res5: Int = 0
scala> lines.hasNext
res6: Boolean = false
You see, when you execute size
twice, the result is zero.
There are two solutions, either you force the iterator into something 'stable', like lines.toSeq
. Or you forget about size
and do the "normal" iteration:
while(lines.hasNext) {
val linesSub = lines.take(10)
println("batch:" + linesSub.size)
// do something with linesSub
}
None of the above answers quite hits the nail on the head.
Theres a good reason why an Iterator
is returned here. By being lazy, it takes pressure off the heap, and the String representing each line can then be garbage collected as soon as you've finished with it. In the case of large files, this can make all the difference for avoiding an OutOfMemoryException.
Ideally, you'd work directly with the iterator and not force it into a strict collection type.
Using grouped
then, as per om-nom-nom's answer:
for (linesSub <- lines grouped 10) {
//do something with linesSub
}
And if you wanted to retain the println
counter, zip in an index:
for ( (linesSub, batchIdx) <- (lines grouped 10).zipWithIndex ) {
println("batch " + batchIdx)
//do something with linesSub
}
If you really need the total, invoke getLines
twice. Once for the count, and a second time to actually process the lines.
The second time you call lines.size it returns 0. This is because lines
is an iterator, not an array.
I've rewritten your code in a Seq
way, that was proposed in @0__ answer:
val batchSize = 10;
val lines = Source.fromFile("file.txt").getLines.toSeq;
println( "total lines:"+lines.length);
var processed = 0;
lines.grouped(batchSize).foreach( batch => {
println( "batch ("+processed+","+(processed+Math.min(lines.length-processed,batchSize))+")
total:"+lines.length
);
processed = processed + batchSize;
//do something with batch
}
)
精彩评论