I am working on application which processes large amount of text data gathering statistics on word occurrences (see: Source Code Word Cloud).
Here what the simplified core of my code is doing.
- Enumerate through all files with *.txt extension.
- Enumerate through words in each text files.
- Group by word and count occurrences.
- Sort by occurrences.
- Output top 20.
Everything worked fine with LINQ. Moving to PLINQ brought me significant performance boost. But ... cancelability during long running queries is lost.
It seems that the OrderBy Query is synchronizing data back into main thread and windows messages are not processed.
In the examle below I am demonstarting my implementation of cancelation according to MSDN How to: Cancel a PLINQ Query whic does not work :(
Any other ideas?
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Text;
using System.Threading;
using System.Windows.Forms;
namespace PlinqCancelability
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
m_CancellationTokenSource = new CancellationTokenSource();
}
private readonly CancellationTokenSource m_CancellationTokenSource;
private void buttonStart_Click(object sender, EventArgs e)
{
var result = Directory
.EnumerateFiles(@"c:\temp", "*.txt", SearchOption.AllDirectories)
.AsParallel()
.WithCancellation(m_CancellationTokenSource.Token)
.SelectMany(File.ReadLines)
.SelectMany(ReadWords)
.GroupBy(word => word, (word, words) => new Tuple<int, string>(words.Count(), word))
.OrderByDescending(occurrencesWordPair => occurrencesWordPair.Item1)
.Take(20);
try
{
foreach (Tuple<int, string> tuple in result)
{
Console.WriteLine(tuple);
}
}
开发者_如何转开发 catch (OperationCanceledException ex)
{
Console.WriteLine(ex.Message);
}
}
private void buttonCancel_Click(object sender, EventArgs e)
{
m_CancellationTokenSource.Cancel();
}
private static IEnumerable<string> ReadWords(string line)
{
StringBuilder word = new StringBuilder();
foreach (char ch in line)
{
if (char.IsLetter(ch))
{
word.Append(ch);
}
else
{
if (word.Length != 0) continue;
yield return word.ToString();
word.Clear();
}
}
}
}
}
As Jon said, you'll need to start the PLINQ operation on a background thread. This way, the user interface doesn't hang while waiting until the operation completes (so the event handler for Cancel button can be invoked and the Cancel
method of the cancellation token gets called). The PLINQ query cancels itself automatically when the token is cancelled, so you don't need to worry about that.
Here is one way to do this:
private void buttonStart_Click(object sender, EventArgs e)
{
// Starts a task that runs the operation (on background thread)
// Note: I added 'ToList' so that the result is actually evaluated
// and all results are stored in an in-memory data structure.
var task = Task.Factory.StartNew(() =>
Directory
.EnumerateFiles(@"c:\temp", "*.txt", SearchOption.AllDirectories)
.AsParallel()
.WithCancellation(m_CancellationTokenSource.Token)
.SelectMany(File.ReadLines)
.SelectMany(ReadWords)
.GroupBy(word => word, (word, words) =>
new Tuple<int, string>(words.Count(), word))
.OrderByDescending(occurrencesWordPair => occurrencesWordPair.Item1)
.Take(20).ToList(), m_CancellationTokenSource.Token);
// Specify what happens when the task completes
// Use 'this.Invoke' to specify that the operation happens on GUI thread
// (where you can safely access GUI elements of your WinForms app)
task.ContinueWith(res => {
this.Invoke(new Action(() => {
try
{
foreach (Tuple<int, string> tuple in res.Result)
{
Console.WriteLine(tuple);
}
}
catch (OperationCanceledException ex)
{
Console.WriteLine(ex.Message);
}
}));
});
}
You're currently iterating over the query results in the UI thread. Even though the query is executing in parallel, you're still iterating over the results in the UI thread. That means the UI thread is too busy performing computations (or waiting for the query to get results from its other threads) to respond to the click on the "Cancel" button.
You need to punt the work of iterating over the query results onto a background thread.
I think I found some elegant solution, which fits better in LINQ / PLINQ concept.
I am declaring an extension method.
public static class ProcessWindowsMessagesExtension
{
public static ParallelQuery<TSource> DoEvents<TSource>(this ParallelQuery<TSource> source)
{
return source.Select(
item =>
{
Application.DoEvents();
Thread.Yield();
return item;
});
}
}
And than adding it to my query wherever I want to be responsive.
var result = Directory
.EnumerateFiles(@"c:\temp", "*.txt", SearchOption.AllDirectories)
.AsParallel()
.WithCancellation(m_CancellationTokenSource.Token)
.SelectMany(File.ReadLines)
.DoEvents()
.SelectMany(ReadWords)
.GroupBy(word => word, (word, words) => new Tuple<int, string>(words.Count(), word))
.OrderByDescending(occurrencesWordPair => occurrencesWordPair.Item1)
.Take(20);
It works fine!
See my post on it for more info and source code to play with: “Cancel me if you can” or PLINQ cancelability & responsiveness in WinForms
精彩评论