开发者

scala quirky in this while loop code

开发者 https://www.devze.com 2023-03-22 01:05 出处:网络
Yesterday, this piece of code caused me a headache. I fixed it by开发者_运维知识库 reading the file line by line. Any ideas ?

Yesterday, this piece of code caused me a headache. I fixed it by开发者_运维知识库 reading the file line by line. Any ideas ?

The while loop never seems to get executed even though the no of lines in the file is greater than 1.

 val lines = Source.fromFile( new File("file.txt") ).getLines;

 println( "total lines:"+lines.size );

 var starti = 1;
 while( starti < lines.size ){
   val nexti = Math.min( starti + 10, lines.size  );

   println( "batch ("+starti+", "+nexti+") total:" + lines.size )
   val linesSub = lines.slice(starti, nexti)
   //do something with linesSub
   starti = nexti
 }


This is indeed tricky, and I would even say it's a bug in Iterator. getLines returns an Iterator which proceeds lazily. So what seems to happen is that if you ask for lines.size the iterator goes through the whole file to count the lines. Afterwards, it's "exhausted":

scala> val lines = io.Source.fromFile(new java.io.File("....txt")).getLines
lines: Iterator[String] = non-empty iterator

scala> lines.size
res4: Int = 15

scala> lines.size
res5: Int = 0

scala> lines.hasNext
res6: Boolean = false

You see, when you execute size twice, the result is zero.

There are two solutions, either you force the iterator into something 'stable', like lines.toSeq. Or you forget about size and do the "normal" iteration:

while(lines.hasNext) {
  val linesSub = lines.take(10)
  println("batch:" + linesSub.size)
  // do something with linesSub
}


None of the above answers quite hits the nail on the head.

Theres a good reason why an Iterator is returned here. By being lazy, it takes pressure off the heap, and the String representing each line can then be garbage collected as soon as you've finished with it. In the case of large files, this can make all the difference for avoiding an OutOfMemoryException.

Ideally, you'd work directly with the iterator and not force it into a strict collection type.

Using grouped then, as per om-nom-nom's answer:

for (linesSub <- lines grouped 10) {
  //do something with linesSub
}

And if you wanted to retain the println counter, zip in an index:

for ( (linesSub, batchIdx) <- (lines grouped 10).zipWithIndex ) {
  println("batch " + batchIdx)
  //do something with linesSub
}

If you really need the total, invoke getLines twice. Once for the count, and a second time to actually process the lines.


The second time you call lines.size it returns 0. This is because lines is an iterator, not an array.


I've rewritten your code in a Seq way, that was proposed in @0__ answer:

val batchSize = 10;
val lines = Source.fromFile("file.txt").getLines.toSeq;

 println( "total lines:"+lines.length);

 var processed = 0;
 lines.grouped(batchSize).foreach( batch => {
      println( "batch ("+processed+","+(processed+Math.min(lines.length-processed,batchSize))+")
               total:"+lines.length
      );
      processed = processed + batchSize;
      //do something with batch
   }
 )
0

精彩评论

暂无评论...
验证码 换一张
取 消