Following is the scenario and some proposed solutions. Are there any better solutions?
There is a system A which has to "analyse" lots of URLs. Another system B generates these URLs - currently there are about 10 million of them in a database. Sample schema:
id URL has_extracted
1 abc.com 0
2 bit.ly 1
My solutions are as follows:
Naive solution: Have a perl script/process which feeds the URL (from the database) to system B and updates the has_extracted column The problem with this approac开发者_开发知识库h is that it does not scale well.
Solution 2:Split up the database into five(or n) tables . (I am planning to remove the has_extracted column because it seems such a scalability bottle-neck in this scenario.)
Solution 3: Remove the has_extracted column Create another table which maintains/tracks the last URL tracked by each process.
Critiques/Proposed solutions requested. Thanks in advance.
why doesn't your naive solution scale well? if you're using bulk updates and commit infrequently, you can update 1 million rows per second on any database, without any tuning.
If you want to run multiple instances of system A, you can use a hash function to divide the input data into groups, where each instnace of system A consumes exactly one group.
If you have a constant number of instances of system A, e.g. 17, you can use the function id%17
as the hash function.
I think this can be as follows:
- URL generator (1pcs or many PCS)
- URL stack (1pcs)
- URL processor (many pcs)
URL generator(s) generates URLs and pushes all of them to stack, say, in database. Or in memory or where you want.
URL processors consult URL stack to give them one next URL to process. URL Stack gives them URL and marks it as a given or deletes it. When URL processor is finished processing an URL, it consults URL stack again and says, that it finished processing URL1 and wants to process URL2. URL Stack can then mark/delete URL1 from it's list and give URL2.
If URL stack becomes a narrow thing, you can just clusterize database.
I somehow feel my problem is similar to the one posted on this link (an extract provided below). The solution in the afore mentioned link and this link - "Databases suck for messaging" have given me a better direction at implementing a better solution.
Extract: So you want to build a system that does jobs. You want the jobs to be able to be run in parallel for speed, but also for redundancy. This system needs to be coordinated so, for example, the same jobs aren't being done twice, the status of every job is easy to see, and multiple servers can run jobs by simply querying the central source.
精彩评论