I just accepted a similar question (PHP + MySQL Queue), but I realized that it wasn't the correct question for my problem, but was the correct ans开发者_高级运维wer for my question :)
I have a MySQL (MyISAM type) table of sites to be scraped by workers.
CREATE TABLE `site` (
`id` int(11) NOT NULL auto_increment,
`url` text,
`last_pop` int(13) default NULL,
`md5` varchar(32) default NULL,
`disabled` tinyint(1) default '0',
PRIMARY KEY (`id`),
UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
What I need is to scrap one site per worker without repeat. So, if I have 3 sites and 2 workers the system needs to work like this:
ID URL LAST_POP
t4 1 site1 t1 <- worker1 scrap site1
t4 2 site2 t2 <- worker2 scrap site2
t5 3 site3 t3 <- worker1 scrap site3
t6 1 site2 t4 <- worker2 scrap site2
t6 2 site1 t4 <- worker1 scrap site1
t7 3 site3 t5 <- worker2 scrap site3
....
It's like a cycled queue orderer by last_pop ASC.
How can I do that?
You'll probably want to keep track of two pieces of information for each site: when it was last scraped and if it is currently being scraped.
Using the answer for your other question, set the scraping
field to the id of the worker to lock it from other workers. When the worker is finished with the task set the scraping
field back to null
and the last_scrape
date to the current time.
CREATE TABLE `site` (
`id` int(11) NOT NULL auto_increment,
`url` text,
`last_scrape` TIMESTAMP,
`scraping` tinyint(1) default NULL,
`md5` varchar(32) default NULL,
`disabled` tinyint(1) default '0',
PRIMARY KEY (`id`),
UNIQUE KEY `md5` (`md5`),
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Lock and retrieve the next job (the site that was last scraped the longest time ago):
Update site
set `scraping` = '$worker_id'
where `scraping` is null
order by `last_scrape` ASC limit 1;
$job =
Select * from site
where `scraping` = '$worker_id'
Release job back to queue:
Update site
set `scraping` = NULL,
`last_scrape` = NOW()
where `scraping` = '$worker_id';
Why not add an extra boolean column STATUS so that you can order by LAST_POP. So when a site is selected by a worker for scrapping run a second query with UPDATE site SET status = '1'
. And when the next worker selects the next site run a query with SELECT * FROM site WHERE status = '0' ORDER BY last_pop ASC
.
精彩评论