I built a program to loop through words and get their synonym开发者_JAVA百科s from www.dicsin.com.br, but this will take ages (literally), because there are 307k words on my testfile.txt, what can I do ? give me advises please, can I make it multi-process or multi-threaded, i don't know, i'm new to PHP and programming, thank you anyway, btw, this is my full working code:
<?
//Pega palavras do site: www.dicsin.com.br
pegarSinonimos("http://www.dicsin.com.br/content/dicsin_lista.php");
function pegaPalavras()
{
return file('testfile.txt');
}
function pegarSinonimos($url)
{
$dicionario = pegaPalavras();
$array_palavras = array();
$array_palavras2 = array();
$con = mysql_connect("localhost","root","whatever");
if (!$con)
{
die('Could not connect: ' . mysql_error());
}
mysql_select_db("palavras2", $con);
foreach($dicionario as $palavra)
{
$url_final = $url . "?f_pesq=" . $palavra;// . "&pagina=" . $pagina;
$html = file_get_contents($url_final);
$dom = new DOMDocument();
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[@class="palavras_encontradas"]/div[@class="box_palavras_encontradas"]');
foreach ($tags as $tag)
{
$bla = $tag->nodeValue;
$bla = utf8_decode($bla);
$bla = str_replace("visualizar palavras", "", $bla);
$bla = str_replace("(Sinônimo) ", "", $bla);//echo $bla;//array_push($array_palavras,$tag->nodeValue);
$sql = "CREATE TABLE $palavra(sinonimo varchar(29))";
mysql_query($sql,$con);
mysql_query("INSERT INTO $palavra (sinonimo) VALUES ('$bla')");
}
}
mysql_close($con);
}
?>
Develop a hash table and do a lookup against that. This will achieve O(1) constant time.
If you want to make it multi-threaded, you can fork the processes with the PCNTL functions, using --enable-pcntl
1. Complexity
Like FinalForm said the complexity of your algorithm is too high(O(n^2)). You should avoid loop inside loop(inside even another loop). You should always compute the complexity of your algorithm(can be difficult doing mathematically)
Low hanging fruit
To help you optimize slow part you should only tackle your low hanging fruit using tools like xdebug/calgrind. I advice you to watch this video "simple is hard" from PHP creator Rasmus to learn this concept. When you tackle low hanging fruit you will get the most buck for your money
Curl_multi
I think the real slow part is that you are doing Curls one at the time blocking(can't do anything else in the mean time). I don't think The other loops take that much time(compared to fetching from remote host, which I think is your low hanging fruit). You could use multi_curl to multiplex retrievning your URLs => http://www.onlineaspect.com/2009/01/26/how-to-use-curl_multi-without-blocking/. This is should be much faster than blocking file_get_content
Message Queue(MQ)
Although this is not available on shared hosting(or not liked very much). But to make your site really fast you should by process your load offline using MQ like for example redis or beanstalkd. You then should handle each separate task offline using message queue and communicate pieces back.
If you use the latest NaturePHP library for PHP 5.3+ and have cURL installed, this should give you a huge boost:
<?php
include('nphp/init.php');
function pegaPalavras()
{
return file('testfile.txt');
}
//Pega palavras do site: www.dicsin.com.br
pegarSinonimos("http://www.dicsin.com.br/content/dicsin_lista.php");
function pegarSinonimos($url)
{
$dictionary = pegaPalavras();
$files = array();
$con = mysql_connect("localhost","root","blablabla"); //omg! a root pwd :O
if (!$con)
{
die('Could not connect: ' . mysql_error());
}
mysql_select_db("palavras2", $con);
foreach($dictionary as $palavra)
{
$files[] = $url . "?f_pesq=" . $palavra;// . "&pagina=" . $pagina;
}
//Http::multi_getcontents on NaturePHP makes use of curl_multi for parallel processing and
//fires callbacks asap, first come first serve style
//it should, however, take a lot on CPU and bandwith while processing
Http::multi_getcontents(
//uris
$files,
//process callback
function($url, $content){
list(, $word) = explode('=', $content);
$dom = new DOMDocument();
$dom->loadHTML($content);
$xpath = new DOMXPath($dom);
$tags = $xpath->query('//div[@class="palavras_encontradas"]/div[@class="box_palavras_encontradas"]');
//you create one table per word
$sql = "CREATE TABLE $palavra(sinonimo varchar(29))";
mysql_query($sql,$con);
//if something was found
if(count($tags)>0){
//get an array with synonyms
$synonyms=array();
foreach ($tags as $tag)
{
$synonyms[] = utf8_decode($tag->nodeValue);
}
//you can use str_replace on arrays, it's faster
$synonyms = str_replace("visualizar palavras", "", $synonyms);
$synonyms = str_replace("(Sinônimo) ", "", $synonyms);
//a single insert query with all values is much faster
$values = "('" . implode("'), ('", $synonyms) . "')";
mysql_query("INSERT INTO $palavra (sinonimo) VALUES $values");
}
});
mysql_close($con);
}
?>
Haven't actually tested the code here, so there could be mionor bugs, but you get the general concept ;)
If you don't have php 5.3+, you can take a look at the source code on NaturePHP on how to use curl_multi.
PS: you might wanna change your root pwd :x
精彩评论