I have a Posrgres 9.04 database table with over 12,000,000 rows.
I need a program to read each row, do some calculations and lookups (against a 2nd table), then write a new row in a 3rd table with the results of these calculations. When done, the 3rd table will have the same number of rows as the 1st table.
Executing serially on a Core i7 720QM processor takes more than 24 hours. It only taxes one of my 8 cores (4 physical cores, but 8 visible to Windows 7 via HTT).
I want to speed this up with parallelism. I thought I could use PLINQ and Npgsql:
NpgsqlDataReader records = new NpgsqlCommand("SELECT * FROM table", conn).ExecuteReader();
var single_record = from row in records.AsParallel()
select row;
However, I get an error for records.AsParallel()
: Could not find an implementation of the query pattern for source type 'System.Linq.ParallelQuery'. 'Select' not found. Consider explicitly specifying the type of the range variable 'row'.
I've done a lot of Google searches, and I'm just coming up more confused. NpgsqlDataReader inherits from System.Data.Common.开发者_如何学编程DbDataReader, which in turn implements IEnumerable, which has the AsParallel extension, so seems like the right stuff is in place to get this working?
It's not clear to me what I could even do to explicitly specify the type of the range variable. It appears that best practice is not to specify this.
I am open to switching to a DataSet, presuming that's PLINQ compatible, but would rather avoid if possible because of the 12,000,000 rows.
Is this even something achievable with Npgsql? Do I need to use Devart's dotConnect for PostgreSQL instead?
UPDATE: Just found http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae, which led me to try this:
foreach(IDataRecord arrest in
from row in arrests.AsParallel().Cast <IDataRecord>()
select row)
So far no errors in the IDE, but is this a proper way of constructing this?
This is indeed the solution:
foreach(IDataRecord arrest in
from row in arrests.AsParallel().Cast <IDataRecord>()
select row)
This solution was inspired by what I found at http://social.msdn.microsoft.com/Forums/en-US/parallelextensions/thread/2f5ce226-c500-4899-a923-99285ace42ae#1956768e-9403-4671-a196-8dfb3d7070e3. It's not clear to me why the cast and type specification is needed, but it works.
EDIT: While this doesn't cause syntax or runtime errors, it in fact does not make things run in parallel. Everything is still serialized. See PLINQ on ConcurrentQueue isn't multithreading for a superior solution.
You should consider using Greenplum. It's trivial to accomplish this in a Greenplum database. The free version isn't gimped in any way and it's postgresql at its core.
精彩评论