开发者

Parallel tree traversal in C#

开发者 https://www.devze.com 2023-03-28 16:17 出处:网络
I need to traverse a tree quickly, and I would like to do it in parallel. I\'d rather use the parallel extensions than manually spin up a bunch of threads.

I need to traverse a tree quickly, and I would like to do it in parallel. I'd rather use the parallel extensions than manually spin up a bunch of threads.

My current code looks something like this:

   public void Traverse(Node root)
    {
        var nodeQueue = new Queue<Node>();
        nodeQueue.Enqueue(root);
        while (nodeQueue.Count!=0)
        {
            var node = nodeQueue.Dequeue();
            if (node.Property = someValue) DoSomething(node);
            foreach (var node in node.Children)
            {
                nodeQueue.Enqueue(node);
            }
        }
    }

I was really hoping that Parallel.ForEach had a Parallel.While analog. I came across Stephen Toub's article on Implementing Parallel While with Parallel.ForEach. If read it correctly this still won't work because I am mutating the queue I am trying to iterate.

Do I need to use a task factory and recursion (and is that risky?) ? or is there some simple solution I am overlooking?

Edit: @svick

The tree has just over 250,000 nodes. The maximum depth right now is 14 nodes de开发者_如何学编程ep including the root.

There are about 500 nodes off the root, and the balance after that has a fairly random distribution. I'll get some better stats on the distribution soon.

@Enigmativity:

Yes, the tree is being modified, concurrently by many users, but I will usually have a shared read lock for the tree or sub tree, or allow for dirty reads.

Calls to node.Children can be considered atomic.

DoSomething is really one of several delegates, for some expensive operations I will probably gather a snapshot list of nodes and process them outside the traversal.

I realized that I should probably look at the general case (a sub-tree being traversed instead of the entire tree.) To that end I ran traverse on every node of the tree and looked at the total time.

I used a Parallel.ForEach(nodes, Traverse) for each traversal algorithm, where nodes contained all ~250k nodes. This simulated (sort of) a lot of users simultaneously requesting a lot of different nodes.

00256ms Breadth First Sequential

00323ms Breadth First Sequential with work (i incremented a static counter as "work")

01495ms Kirks First answer

01143ms Svicks Second answer

00000ms Recursive Single Threaded didn't finish after 60s

00000ms Enigmativity's answer didn't finish after 60s

@Enigma, I think it's possible I might have messed up your alogrithm somehow, because it seems like it should be much quicker.

The results surprised me to say the least. I had to add some work to the breadth first sequential just to convince myself that the compiler wasn't magically optimizing away the traversals.

For the single traversal of the head, parallelizing the first level only had the best performance. But just barely, this number improved as I added more nodes to the second level (2000 instead of 500).


The most direct way would be to create a Task for each child node and then wait for all of them:

public void Traverse(Node root)
{
    if (node.Property == someValue)
        DoSomething(node);

    var tasks = new List<Task>();

    foreach (var node in node.Children)
    {
        // tmp is necessary because of the way closures close over loop variables
        var tmp = node;
        tasks.Add(Task.Factory.StartNew(() => Traverse(tmp)));
    }

    Task.WaitAll(tasks.ToArray());
}

Task is fairly light-weight, so creating lots of them works reasonably well. But they do have some overhead, so doing something more complicated like having a few tasks that share a queue is probably going to be faster. If that's the way you're going to go, don't forget that empty queue doesn't mean all work is done. Classes from the System.Collections.Concurrent namespace are going to come handy if you went this way.

EDIT: Because of the shape of the tree (the root has about 500 children), processing just the first level in parallel should give good performance:

public void Traverse(Node root, bool parallel = true)
{
    if (node.Property == someValue)
        DoSomething(node);

    if (parallel)
    {
        Parallel.ForEach(node.Children, node =>
        {
            Traverse(node, false);
        });
    }
    else
    {
        foreach (var node in node.Children)
        {
            Traverse(node, false);
        }
    }
}


I might be missing something, but I don't see the need for a while at all. The while is just ensuring that you iterate over every node.

Instead just call your function recursively for each node in the tree.

public void Traverse(Node root)
{         
    if (root.Property = someValue) DoSomething(node);    
    Parallel.ForEach<Node>(root.Children, node => Traverse(node));
} 

edit: of course the alternative, if you prefer to process horizontally rather than vertically and your expensive operation is DoSomething, is to do the Traverse first.

public IEnumerable<Node> Traverse(Node root)
{
    // return all the nodes on this level first, before recurring
    foreach (var node in root.Children)
    {
        if (node.Property == someValue)
            yield return node;
    }

    // next check children of each node
    foreach (var node in root.Children)
    {
        var children = Traverse(node);
        foreach (var child in children)
        {
            yield return child;
        }
    }
}

Parallel.ForEach<Node>(Traverse(n), n => DoSomething(n));


Since the traversal of the tree is extremely fast, that calls to Children are atomic, and that it is the expensive nature of the DoSomething delegates that need to be executed in parallel, here's my take on the solution.

I started with the idea that I needed a function that takes a node as a parameter, creates a task that executes DoSomething, recursively calls itself to create tasks for all of the children nodes, and finally returns a Task that waits for all of the internal tasks to be completed.

Here it is:

Func<Node, Task> createTask = null;
createTask = n =>
{
    var nt = Task.Factory.StartNew(() =>
    {
        if (n.Property == someValue)
            DoSomething(n);
    });
    var nts = (new [] { nt, })
        .Concat(n.Children.Select(cn => createTask(cn)))
        .ToArray();

    return Task.Factory.ContinueWhenAll(nts, ts => { });
};

All that is required to call it and wait for the traversal to complete is:

createTask(root).Wait();

I tested this by creating a tree of nodes with 500 children off of the root with 14 levels, with 1 or 2 subsequent children per node. This gave me a total of 319,501 nodes.

I created a DoSomething method that performed some work - for (var i = 0; i < 100000 ; i++) { }; - and then ran the above code and compared it to processing the same tree in series.

The parallel version took 5,151 ms. The sequential version 13,746 ms.

I also performed a test where I reduced the number of nodes to 3,196 and increased the processing time for DoSomething by 100x. The TPL very cleverly reverts to running sequentially if its tasks complete quickly so lengthening the processing time made the code run with more parallelism.

Now the parallel version took 3,203ms. The sequential version took 11,581ms. And, if I only called the createTask(root) function without waiting for it to complete it took just 126ms. This means that the tree is traversed very quickly, and it would then make sense to lock the tree during traversal and unlock it when processing is taking place.

I hope this helps.


Assuming you have p processors maybe you do a Parallel.For over root.Children with p partitions. Each of these would do the traditional single-thread traverse over the subtrees, compare, and, rather than DoSomething, would enqueue a delegate to DoSomething to a concurrent queue. If the distribution is basically random and balanced and since traversal only does traversal/enqueue, that portion takes 1/p th the time. Also, traversal would likely exhaust itself before all the DoSomethings would execute, so you could have p consumers (executors of DoSomething) giving you maximum parallel execution, assuming all these operations are independent.

With this naive partitioning across the number of root children with randomly distributed subtrees, traversal itself will be speedy. With your consumers roughly allocated per processor, you also get max parallel DoSomething action.


People often overcomplicate problems like this by being using an imperative code style, which leads to doing all the complicated work inside of the recursive function. Presumably, traversing your nodes is very fast, and you're hoping to get value from parallelizing calls to DoSomething, right?

Here's a simple and generic method you can use to get every node from any tree-like structure:

public IEnumerable<T> GetSelfAndDescendants<T>(T root, Func<T, IEnumerable<T>> getChildren)
{
    yield return root;
    foreach (var descendants in getChildren(root).SelectMany(child => GetSelfAndDescendants(child, getChildren)))
    {
        yield return descendants;
    }
}

With this in your tool belt, you can do something like this:

GetSelfAndDescendants(rootNode, n => n.Children)
    .Where(node => node.Property == someValue)
    .AsParallel()
    .ForAll(DoSomething);

This separates the problem of traversing the tree from the problem of processing its nodes, which means the recursive part of the process doesn't have to complicate the parallel part.

Also, since now there's only one entry point into the parallel library (the call to AsParallel() only happens once), you have a lot more control over how many tasks are executing concurrently. That's important because if you spin up ten thousand parallel threads they're all just going to be contending for resources and won't actually complete any faster. I suspect this is why the other answers had such poor performance: they're spinning up new threads at every level of recursion, so even if each Parallel.ForEach is limiting concurrency, there are dozens or hundreds of them getting invoked at the same time.

Using the approach I've detailed, you can either let .NET pick a sane default for you, or you can call .WithDegreeOfParallelism(...) to play with different numbers of concurrent tasks.

Based on my test cases, with DoSomething just running through a non-optimized for (var i = 0; i < 100000 ; i++) { }, this takes about one tenth as long as a simple foreach across the same collection.

I should note that if DoSomething doesn't lend itself to parallelism (e.g. it locks on shared resources), you may still find that it's faster to process the items serially.

DoSomething is really one of several delegates, for some expensive operations I will probably gather a snapshot list of nodes and process them outside the traversal.

This works well for that use case: just add a .ToList() in the appropriate spot to capture the nodes.

Update

Presumably, traversing your nodes is very fast, and you're hoping to get value from parallelizing calls to DoSomething, right?

No. In this case, it was the actual traversal that I was trying to optimize. This was for a system handling a lot of data near real time and we were trying to stay under 200ms for entire calls to the system not just the traversal.

In that case, it might be reasonable to follow svick's advice and just parallelize the top level of child nodes in your tree. If DoSomething() does practically nothing, then starting with a call to this method can boost your performance significantly:

public void Hybrid(Node root)
{
    if (root.Property == someValue) DoSomething(root);
    Parallel.ForEach(root.Children, Traverse);
}

Parallel tree traversal in C#

Benchmark: http://share.linqpad.net/gpc4pc.linq

Of course, that assumes that you can really benefit from multiple cores (i.e. your system isn't handling a bunch of other operations at the same time). If traversing your tree is really your performance bottleneck, and you find yourself traversing the tree often enough for it to matter, I'd consider whether your problem can be solved better with a different data structure and algorithm.


Here are two parallel versions of the Traverse method. The first is the simplest one. The ParallelTraverse invokes a synchronous delegate for the root node and its descendants (its children, its children's children etc). The iterator uses a Stack<T> instead of a Queue<T>, for reasons of memory efficiency. The Stack<T> will contain at any moment only a handful of nodes. On the contrary a Queue<T> would store more than a half of the total number of nodes. Regarding the various ways that a tree can be traversed, you can look at this question.

/// <summary>
/// Invokes a delegate for a node and all its descendants in parallel.
/// </summary>
public static ParallelLoopResult ParallelTraverse<TNode>(
    TNode root,
    ParallelOptions parallelOptions,
    Action<TNode, ParallelLoopState> body,
    Func<TNode, IReadOnlyList<TNode>> childrenSelector)
{
    ArgumentNullException.ThrowIfNull(parallelOptions);
    ArgumentNullException.ThrowIfNull(body);
    ArgumentNullException.ThrowIfNull(childrenSelector);
    IEnumerable<TNode> Iterator()
    {
        Stack<TNode> stack = new();
        stack.Push(root);
        while (stack.Count > 0)
        {
            TNode node = stack.Pop();
            yield return node;
            IReadOnlyList<TNode> children = childrenSelector(node);
            if (children is null) continue;
            for (int i = children.Count - 1; i >= 0; i--)
                stack.Push(children[i]);
        }
    }
    return Parallel.ForEach(Iterator(), parallelOptions, body);
}

Usage example:

ParallelOptions options = new() { MaxDegreeOfParallelism = Environment.ProcessorCount };
ParallelTraverse(root, options, (node, state) =>
{
    if (node.Property = someValue) DoSomething(node);
}, node => node.Children);

Only the body is parallelized. The childrenSelector is called sequentially (for one node at a time).

The second version has identical signature and features, and differs only in the behavior. The ParallelTraverseHierarchical ensures that the body of all nodes will not be invoked before the completion of the body of their parents. This can be important in case, for example, that the nodes must be saved in a database, and saving a node requires the autogenerated ID of its parent.

/// <summary>
/// Invokes a delegate for a node and all its descendants in parallel.
/// Children are invoked after the completion of their parent.
/// </summary>
public static ParallelLoopResult ParallelTraverseHierarchical<TNode>(
    TNode root,
    ParallelOptions parallelOptions,
    Action<TNode, ParallelLoopState> body,
    Func<TNode, IReadOnlyList<TNode>> childrenSelector)
{
    ArgumentNullException.ThrowIfNull(parallelOptions);
    ArgumentNullException.ThrowIfNull(body);
    ArgumentNullException.ThrowIfNull(childrenSelector);
    using BlockingCollection<TNode> stack = new(new ConcurrentStack<TNode>());
    stack.Add(root);
    int pendingCount = stack.Count;

    Partitioner<TNode> partitioner = Partitioner.Create(
        stack.GetConsumingEnumerable(), EnumerablePartitionerOptions.NoBuffering);

    return Parallel.ForEach(partitioner, parallelOptions, (node, state) =>
    {
        try
        {
            body(node, state);
            IReadOnlyList<TNode> children = childrenSelector(node);
            if (children is null) return;
            Interlocked.Add(ref pendingCount, children.Count);
            // Ideally the children should be added all at once as an atomic
            // operation, but although the ConcurrentStack does have an atomic
            // PushRange method, the BlockingCollection doesn't expose this feature.
            for (int i = children.Count - 1; i >= 0; i--)
                stack.Add(children[i]);
        }
        finally
        {
            if (Interlocked.Decrement(ref pendingCount) == 0)
                stack.CompleteAdding();
        }
    });
}

This version parallelizes both the body and the childrenSelector. One of the two delegates is expected to be chunky. In case both are feather-weight (in the magnitude of microseconds), the overhead of passing every individual node through the BlockingCollection<T> might negate the benefits of parallelization. I have uploaded here a more sophisticated implementation of the ParallelTraverseHierarchical method, based on a stack of stacks, that can traverse millions of nodes per second.

The scenario mentioned above, saving a tree in the database, requires that the TNode type has a mutable property for the database Id, and also that it has a reference to its parent node. This way each node will be able to retrieve the autogenerated ID of its parent. In case this requirement is not easy to meet, the above implementation could be enhanced by changing the type of the body parameter from Action<TNode, ParallelLoopState> to Func<TNode, TArg, ParallelLoopState, TArg>. The TArg would be a generic parameter that would pass information to the children about their parent. This information could be transferred through the blocking collection: BlockingCollection<(TNode, TArg)>. The TArg for the root node could be passed as an extra parameter in the method.


Perhaps using a List or Array instead of queue would help. Also use another List/Array to populate the next nodes to visit. You won't be processing list that until you finish the entire width first anyway. Something like this:

List<Node> todoList = new List<Node>();
todoList.Add(node);
while (todoList.Count > 0)
{
    // we'll be adding next nodes to process to this list so it needs to be thread-safe
    // or just sync access to a non-threadsafe list
    // if you know approx how many nodes you expect, you can pre-size the list
    ThreadSafeList<Node> nextList = new ThreadSafeList<Node>();  

    //todoList is readonly/static so can cache Count in simple variable
    int maxIndex  =  todoList.Count-1;
    // process todoList in parallel
    Parallel.For(0, maxIndex, i =>
    {
        // if list reads are thread-safe then no need to sync, otherwise sync
        Node x = todoList[i];

        //process x;
        // e.g. do somehting, get childrenNodesToWorkOnNext, etc.

        // add any child nodes that need to be processed next
        // e.g. nextList.add(childrenNodesToWorkOnNext);
    });

   // done with parallel processing by here so use the next todo list
   todoList = nextList;
)
0

精彩评论

暂无评论...
验证码 换一张
取 消