开发者

Dealing with the surprising lack of ParList in scala.collections.parallel

开发者 https://www.devze.com 2023-03-19 17:57 出处:网络
So scala 2.9 recen开发者_运维百科tly turned up in Debian testing, bringing the newfangled parallel collections with it.

So scala 2.9 recen开发者_运维百科tly turned up in Debian testing, bringing the newfangled parallel collections with it.

Suppose I have some code equivalent to

  def expensiveFunction(x:Int):Int = {...}

  def process(s:List[Int]):List[Int} = s.map(expensiveFunction)

now from the teeny bit I'd gleaned about parallel collections before the docs actually turned up on my machine, I was expecting to parallelize this just by switching the List to a ParList... but to my surprise, there isn't one! (Just ParVector, ParMap, ParSet...).

As a workround, this (or a one-line equivalent) seems to work well enough:

  def process(s:List[Int]):List[Int} = {
    val ps=scala.collection.parallel.immutable.ParVector()++s
    val pr=ps.map(expensiveFunction)
    List()++pr
  }

yielding an approximately x3 performance improvement in my test code and achieving massively higher CPU usage (quad core plus hyperthreading i7). But it seems kind of clunky.

My question is a sort of an aggregated:

  • Why isn't there a ParList ?
  • Given there isn't a ParList, is there a better pattern/idiom I should adopt so that I don't feel like they're missing ?
  • Am I just "behind the times" using Lists a lot in my scala programs (like all the Scala books I bought back in the 2.7 days taught me) and I should actually be making more use of Vectors ? (I mean in C++ land I'd generally need a pretty good reason to use std::list over std::vector).


Lists are great when you want pattern matching (i.e. case x :: xs) and for efficient prepending/iteration. However, they are not so great when you want fast access-by-index, or splitting into chunks, or joining (i.e. xs ::: ys).

Hence it does not make much sense (to have a parallel List) when you think that this kind of thing (splitting and joining) is exactly what is needed for efficient parallelism. Use:

xs.toIndexedSeq.par


First, let me show you how to make a parallel version of that code:

def expensiveFunction(x:Int):Int = {...}

def process(s:List[Int]):Seq[Int] = s.par.map(expensiveFunction).seq

That will have Scala figure things out for you -- and, by the way, it uses ParVector. If you really want List, call .toList instead of .seq.

As for the questions:

  • There isn't a ParList because a List is an intrinsically non-parallel data structure, because any operation on it requires traversal.

  • You should code to traits instead of classes -- Seq, ParSeq and GenSeq, for example. Even performance characteristics of a List are guaranteed by LinearSeq.

  • All the books before Scala 2.8 did not have the new collections library in mind. In particular, the collections really didn't share a consistent and complete API. Now they do, and you'll gain much by taking advantage of it.

    Furthermore, there wasn't a collection like Vector in Scala 2.7 -- an immutable collection with (near) constant indexed access.


A List cannot be easily split into various sub-lists which makes it hard to parallelise. For one, it has O(n) access; also a List cannot strip its tail, so one need to include a length parameter.

I guess, taking a Vector will be the better solution.

Note that Scala’s Vector is different from std::vector. The latter is basically a wrapper around standard array, a contiguous block in memory which needs to be copied every now and then when adding or removing data. Scala’s Vector is a specialised data structure which allows for efficient copying and splitting while keeping the data itself immutable.

0

精彩评论

暂无评论...
验证码 换一张
取 消