开发者

Using Scala to cut up a large CSV file

开发者 https://www.devze.com 2023-01-14 09:38 出处:网络
What\'s the best way to do file IO in Scala 2.8? All I want to do is cut a massive CSV file into lots of smaller ones with, say 1000 lines of data per file, 开发者_Python百科and each file retaining t

What's the best way to do file IO in Scala 2.8?

All I want to do is cut a massive CSV file into lots of smaller ones with, say 1000 lines of data per file, 开发者_Python百科and each file retaining the header.


For simple tasks like this I would use scala.io.Source. An example would look like this:

val input = io.Source.fromFile("input.csv").getLines()

if (input.hasNext) {
  // assuming one header line
  val header = List(input.next())

  for ((i, lines) <- Iterator.from(1) zip input.grouped(linesPerFile)) {
    val out = createWriter(i) // Create a file for index i
    (header.iterator ++ lines.iterator).foreach(out.println)
    out.close
  }
}


Moritz' answer is good, provided you don't run into some of CSV's more annoying corner cases. A relevant example would be CSV data where one column is a string that might contain line breaks: you can't rely on a row being on a single line, or you'll end up cutting some rows in half.

I'd use a dedicated CSV parsing library to turn your data into an iterator. kantan.csv is an example (I'm the author), but there are other alternatives such as product-collections or opencsv.

0

精彩评论

暂无评论...
验证码 换一张
取 消