I have a specific DLL that contains some language processing classes and methods. One of these methods gets a word as an argument and does some calculation about 3 sec and save the related result on a SQL-Server Db.
I want run this DLL Method on 900k words and this job may repeat every week. How can I easily distribute this work on multiple systems to save the time using 开发者_如何学运维c#?
Answer in the form: Requirement -- Tool
Scheduled Runs -- Quartz.NET
Quartz allows you to run "jobs" on any given schedule. It also maintains state between runs so if for some reason the server goes down, when it comes back up it knows to begin running the job. Pretty cool stuff.
Distributed Queue -- NServiceBus
A good ServiceBus is worth it's weight in gold. Basically what you want to do is ensure that all your workers are only doing a given operation for however many operations are queued. If you ensure your operations are idempotent NServiceBus is a great way to accomplish this.
Queue -> Worker1 += Worker 2 += Worker 3 --> Local Data Storage -> Data Queue + Workers -> Remote Data Storage
Data Cache -- RavenDb or SQLite
Basically in order to ensure that the return values of the given operations are sufficiently isolated from the SQL Server you want to make sure and cache the value somewhere in a local storage system. This could be something fast and non-relational like RavenDB or something structured like SQLite. You'd then throw some identifier into another queue via NServiceBus and sync it to the SQL Server, queues are your friend! :-)
Async Operations -- Task Parallel Library and TPL DataFlow
You essentially want to ensure that none of your operations are blocking and sufficiently atomic. If you don't know about TPL already you should, it's some really powerful stuff! I hear this a lot from Java folks, but it's worth mentioning...C# is becoming a really great language for async and parallel workflows!
Also one cool thing coming out of the new Async CTP is TPL DataFlow. I haven't used it, but it seems to be right up your alley!
Since it's existing code I would look for a way to split that list of 900k words.
Everything else would require much more changes.
I think this is addressed with Dryadlinq. Only know of it, no handson experience myself but it sounds like it fits the bill.
GJ
You could create an application that acted like server software. If would manage the list of words and distribute them to the clients. Your client software would be installed on the distrubuted pc's. You could then use MSMQ for a quick way to communicate back and forth.
You have the right idea. Divide and conquer. This is a typical job for distributed parallel computing. Let's say you have five machines, each with four cores, hyper-threaded. This gives you 40 logical processors.
As you have described, you have 750 hours of processing to do plus a little overhead. If you can split up the work onto 40 processing threads, you can get it all done in less than 20 hours. Splitting up the work is the easy part.
The hard part is distributing the work and executing it in parallel. You have some choices here as others have pointed out. Let me a few more for your consideration.
You could manually split the word list by query or some other device and launch separate and unique console applications on each node/workstation that would use the TPL to max out each logical processor of each machine.
You could use something MPAPI and code up your own nodes and workers.
You could install Windows Server on your node/workstations and run Microsoft HPC and using something like MPI.NET to kick off the jobs.
You could write a console application and use DuoVia.MpiVisor to distribute and execute on your workstations. (Full disclosure: I am the author of MpiVisor)
Good luck to you.
精彩评论