Options for compute bound parallel processing on lists in F#_问答_开发者

async{} workflow is easy to use and gives good results on asynchronous operations like IO. Is this a good option also for purely data bound operations? There's also the .NET ThreadPool and BackgroundWorker, are these better options for data bound computations?

The example I have is a large list, about 10,000 strings all unique. For every item I need to compare the string along with all the strings that come after it in the list. The return type will be a list<string * list<string>> containing each of the original items along with a list of possible solutions (possible strin开发者_如何学运维g matches, based on some application specific algorithm).

My original thought was to use an ansyc{} workflow, but I'm unsure if this problem would be more suited to other parallel techniques, or even if it should be paralised.

Thre are a few options, but the first question is whether you want to use task-based or data-parallel solution. I recently wrote a few articles about parallel programming in F#, so you may find a few articles from the series useful.

Data-parallel. If you have a purely CPU-bound operation that is data-parallel, then the best options are

Use the PSeq module from F# PowerPack if you have some more complicated declarative data processing. You can find an introduction in my article Using PLINQ and Tasks from F#.
Use Array.Parallel.map function from the standard F# librar if you need just a simple map operation.

Task-based CPU-bound. If data-parallel doesn't fit your problem very well, then you can use .NET 4.0 Tasks or F# asynchronous workflows and the StartAsChild member. Tasks do not support cancellation and look a bit ugly in F#, but they are probably more optimized (if you want to create a really large number of them). On the other hand, F# asynchronous workflows look more elegant in F#. I wrote two articles that compare the options. The first one explains basic features and the second one talks about cancellation.

There is also message passing using F# agents, which may work quite well if you have something that requires complicated coordination (and would be otherwise written using a lot of synchronization primitives and mutable state)

Your problem is CPU intensive since you have all your data in memory and the main CPU cost is the your customized string matching algorithm.

You can use Parallel.For, which is a little bit faster than Async for CPU bounded parallel tasks.

async{} workflow is easy to use and gives good results on asynchronous operations like IO. Is this a good option also for purely data bound operations?

No. Async is for concurrent IO which is completely different from your problem which is parallel CPU-bound computation.

There's also the .NET ThreadPool and BackgroundWorker, are these better options for data bound computations?

Slightly better but they are still not built for this.

list<(string * list)

That is not a valid type.

My original thought was to use an ansyc{} workflow,

Bad idea.

but I'm unsure if this problem would be more suited to other parallel techniques, or even if it should be paralised.

Write a correct serial version first and, if it is too slow, optimize it and maybe parallelize it. You are unlikely to see significant gains from parallelism while you are still using linked lists, for example.

The algorithm you describe can be implemented as follows in F#:

let rec f p a = function
  | [] -> List.rev a
  | x::xs -> f p ((x, List.filter (p x) xs)::a) xs

where p is the arbitrary predicate function your algorithm is generic over. For example, the following finds all subsequent strings equal to each head string:

> f (=) [] ["a"; "b"; "c"; "a"; "c"];;
val it : (string * string list) list =
  [("a", ["a"]); ("b", []); ("c", ["c"]); ("a", []); ("c", [])]

Running this on 10,000 strings on my netbook takes under 6s. Is that fast enough?

Async is intended for both parallel CPU-bound operations and asynchronous IO operations and works well for both. Don Syme has a blog post about this. The operations will still run on the thread pool so you're not giving that up by using async.