Because crawling the web can cost a lot of time I want to let pcntl_fork() help me in creating multiple childs to split my code in parts.
- Master - crawling the domain
- Child - When receiving a link child must crawl the link found on the domain
- Child - Must do the same as 2. when receiving new link.
Can i make as many as i want, or do i have to set a maximum of childs?
Here's my code:
class MyCrawler extends PHPCrawler
{
function handlePageData(&$page_data)
{ // CHECK DOMEIN
$domain = $_POST['domain'];
$keywords = $_POST['keywords'];
//$tags = get_meta_tags($page_data["url"]);
//$iKeyFound = null;
$find = $keywords;
$str = file_get_contents($page_data["url"]);
if(strpos($str, $find) == true && $page_data["received"] == true)
{
$keywords = $_POST['keywords'];
if($page_data["header"]){
echo "<table border='1' >";
echo "<tr><td width='300'>Status:</td><td width='500'> ".strtok($page_data["header"], "\n")."</td></tr>";}
else "<table border='1' >";
// PRINT EERSTE LIJN
echo "<tr><td>Page requested:</td><td> ".$page_data["url"]."</td></tr>";
// PRINT STATUS WEBSITE
// PRINT WEBPAGINA
echo "<tr><td>Referer-page:</td><td> ".$page_data["referer_url"]."</td></tr>";
// CONTENT ONTVANGEN?
if ($page_data["received"]==true)
echo "<tr><td>Content received: </td><td>".$page_data["bytes_received"] / 8 . " Kbytes</td></tr></table>";
else
echo "<tr><td>Content:</td><td> Not received</td></tr></table>";
$domain = $_POST['domain'];
$link = mysql_connect('localhost', 'crawler', 'DRZOIDBERGGG');
if (!$link)
{
die('Could not connect: ' . mysql_error());
}
mysql_select_db("crawler");
if(empty($page_data["referer_url"]))
$page_data["referer_url"] = $page_data["url"];
strip_tags($str, '<p><b>');
$matches = $keywords;
//$match = preg_match_all("'/<(*.?)(*.?)>(*.?)'".$keywords."'(*.?)<\/($1)>/'", $str, $matches, PREG_SET_ORDER);
//echo $match;
$doc = new DOMDocument();
$doc->loadHTML($str);
$xPath = new DOMXpath($doc);
$xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), '".strtoupper($keywords)."')]";
$elements = $xPath->query(开发者_运维百科$xPathQuery);
if($elements->length > 0){
foreach($elements as $element){
print "Gevonden: " .$element->nodeValue."<br />";
}}
$result = mysql_query("SELECT * FROM crawler WHERE data = '".$element->nodeValue."' ") ;
if(mysql_num_rows($result)>0)
echo 'Column already exist';
else{
echo 'added';
mysql_query("INSERT INTO crawler (id, domain, url, keywords, data) VALUES ('', '".$page_data["referer_url"]."', '".$page_data["url"]."', '".$keywords."', '".$element->nodeValue. "' )");
}
echo '<br>';
echo "<br><br>";
echo str_pad(" ", 5000); // "Force flush", workaround
flush();
}
FORGOT TO SAY: I NEED A WIN x(86) 32 bits workaround!
Because it's not supported on my client.
I wonder if you wouldn't be better served by going with something like Gearman for this.
It's a job manager that runs on your system and you submit jobs to it (via php if you like), and then it assigns them to workers (again, written in php), who then report back with their result. It's pretty robust and flexible in that you can let it run more workers to handle more workload.
shell_exec does the thing but don't know how to use.
Look into this: http://in.php.net/manual/en/ref.pcntl.php#37369
精彩评论