开发者

F# lists with sequence operators

开发者 https://www.devze.com 2023-04-05 03:37 出处:网络
After having a look at these two threads: Does F# have an equivalent to Haskell's take? , Take N elements from sequence with N different indexes in F#

After having a look at these two threads: Does F# have an equivalent to Haskell's take? , Take N elements from sequence with N different indexes in F# , I've been wondering about the best way to use sequence operators on lists or even if using them.

I am at the moment a newbie at F# and I'm writing a program which has to deal with lots of sequences that I get from HtmlAgilityPack. There are some interesting operators in the Seq module but as stated in those threads, it might be poor related to performance and if we are obliged to constantly converting between seq -> list it also clutters code with things that are not problem-solving... which is why I started learning F# in the first place.

A simple example is when I need to take 'N' elements of a list:

               listOfRows
               |> Seq.take 2
               // Now I don't have a list anymore, it returns a sequence
               |> List.ofSeq开发者_开发知识库

So, could anyone shed some light about the best way to deal with these scenarios? I can work solutions using Seq.take and Seq.skip but this is known to be very inefficient. On the other hand it's a shame to have the functionality built into the standard library and having to reimplement it for using the same concept on a different collection, or make the code dirtier with explicit conversions.

How could I see the performance impact each conversion between 'list -> seq' and 'seq -> list' has?

Thanks a lot.


These are fairly trivial to implement.

module List =

  let take n items =
    let rec take' acc = function
      | 0, _ -> List.rev acc
      | _, [] -> invalidOp "count exceeds number of elements"
      | n, h::t -> take' (h::acc) (n-1, t)
    take' [] (n, items)

  let rec skip n items =
    match n, items with
    | 0, _ -> items
    | _, [] -> invalidOp "count exceeds number of elements"
    | n, _::t -> skip (n-1) t

And here's how they perform vs their Seq counterparts.

let n = 10000000
let l = List.init n id
let test f = f (n-1) l

test List.take               //Real: 00:00:03.724, CPU: 00:00:03.822, GC gen0: 57, gen1: 34, gen2: 1
test Seq.take |> Seq.toList  //Real: 00:00:04.953, CPU: 00:00:04.898, GC gen0: 57, gen1: 33, gen2: 0
test List.skip               //Real: 00:00:00.044, CPU: 00:00:00.046, GC gen0: 0, gen1: 0, gen2: 0
test Seq.skip |> Seq.toList  //Real: 00:00:01.147, CPU: 00:00:01.154, GC gen0: 0, gen1: 0, gen2: 0

If milliseconds count for your application then maybe it's worth it to create "missing" List functions. Otherwise, I would say using the Seq versions is perfectly fine.


Some of this might depend on exactly how you want to use all this end-to-end.

In many cases it will be fine to convert to a list once up front, and then only use the List operators to map/traverse/etc. There may not be a List.take, but that's because with lists, if you know there are going to be at least two elements and you want to grab those two, you can do it with a pattern match, e.g.

let (item1::item2::rest) = someList

So I suspect that may be what you want to do in this scenario (I expect with HTML parsing, you might have some kind of expected rough schema of elements you're looking for, etc.).

(If laziness/streaming is essential, then Seq becomes much more useful.)

Briefly, the commonest operators (like map) are on all the collection types (Seq, List, Array, ...) whereas the uncommoner ones (like take) are only available on Seq, often because there are better ways to do things when you have a concrete type (e.g. list pattern matching to take the first items).


To add a comment

In a purely functional sense take cannot operate on the list in place - consider

a::b::c::d::[]

if we want only the first 2 elements, we need to at the very least modify b so that we get

a::b::[]

Since b was modified, you will also need to modify a so that it points to the new modified b. As a result of this, it is impossible to implement take in place on a list, which explains why it is missing from the List module.

If you are really worried about performance, profile first, and then consider switching to a different data type. You may be better off using the .Net System.Collections.Generic.List<_> which has many of the same methods as List and Array - http://research.microsoft.com/en-us/um/cambridge/projects/fsharp/manual/fsharp.powerpack/microsoft.fsharp.collections.resizearray.html


You may fully understand performance impact of conversions Seq -> List and List -> Seq by inspecting correspondent conversion implementations:

// List.ofSeq
let ofSeq source = Seq.toList source
// List.toSeq
let toSeq list = Seq.ofList list
// Seq.ofList
let ofList (source : 'T list) = 
        (source :> seq<'T>)
// Seq.toList
let toList (source : seq<'T>) = 
        checkNonNull "source" source
        match source with 
        | :? ('T list) as res -> res
        | :? ('T[]) as res -> List.ofArray res
        | _ -> 
            use e = source.GetEnumerator()
            let mutable res = [] 
            while e.MoveNext() do
                res <- e.Current :: res
            List.rev res

Conversions themself are relatively light on performance when compared with actual operations on collections. Running the following snippet that converts a List of 1 million members to seq and then back to another List on my old Core 2 Duo 2.4Ghz notebook

open System.Diagnostics

let tls = Stopwatch()
let l = [1..1000000]
tls.Start()
let s = List.toSeq l
//Seq.length s |> ignore
//Seq.length s |> ignore
tls.Stop()
printfn "List<int> of 1000000 -> Seq: %d ticks" tls.ElapsedTicks 

let tsl = Stopwatch()
tsl.Start()
let l' = Seq.toList s
//l'.Length |> ignore
//l'.Length |> ignore
tsl.Stop()
printfn "Seq<int> of 1000000 -> List: %d ticks" tsl.ElapsedTicks

shows blasting 42 and 8 ticks correspondently. If we uncomment first respective lines with length counters execution will take 18695 and 12952 ticks. After uncommenting second respective lines with length counters execution duration is showing 38377 and 25404 ticks, which indicates that laziness is unrelated to observed fenomena.

It seems that overhead of conversions between Seq and List types may be negligible compared to Collections operations execution per se.


List to Seq is nothing but creating a iterator (in .net world a Enumerable) on the List, so basically it is not a operation which will cause much of a performance issue (it just create a state machine which hold the state about which is the current element in the list that need to be 'yield' and the increment it as more elements are requested). On the other hand converting a Seq (which will have some underlying collection from which it produces values) to a List is conceptually same as iterating a list and creating a new list from it so it may be a time and memory consuming process in case the list is long enough.

As far as usage of these operator goes one possible approach would be to group all your sequence operator together (same as linq queries where your sort of create a pipeline through which the collection elements gets processed one by one) and then in the end if you need you can create the list from resulting Seq as the list is created at the end of all the filtering, mapping, take etc work on seq and when the final data is ready you convert it to List. Creating intermediate lists will not work out well and will cause performance issues.

0

精彩评论

暂无评论...
验证码 换一张
取 消