开发者

Having trouble with cURL and expiring links

开发者 https://www.devze.com 2023-02-18 05:39 出处:网络
I am working on a page for a library that will display the latest books, movies and items that the library has added to their collection.

I am working on a page for a library that will display the latest books, movies and items that the library has added to their collection.

A friend and I (both of us are new to PHP) have been trying to use cURL to accomplish this. We have gotten the code to grab the sections we want and have it formatted as it should look on the results page.

The problem we are having is that the url which we feed into cURL is automatically generated som开发者_C百科ehow and keeps expiring every few hours and breaks the page.

Here is the PHP we are using:

<?php    
//function storeLink($url,$gathered_from) {
//   $query = "INSERT INTO links (url, gathered_from) VALUES ('$url', '$gathered_from')";
//    mysql_query($query) or die('Error, insert query failed');
//}



// make the cURL request to $target_url
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, "http://catalog.yourppl.org/limitedsearch.asp"); 
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$refreshlink= curl_exec($ch);


$endlink = strpos($refreshlink,'Hot New Items')-2;//end
$startlink = $endlink -249;
$startlink = strpos($refreshlink,'http',$startlink);//start
$endlink = $endlink - $startlink;
$linkurl =  substr("$refreshlink",$startlink, $endlink);
//echo $linkurl;

//this is the link that expires
$linkurl = "http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?NewestSearch&Config=pac&FormId=0&LimitsId=-168&StartIndex=0&SearchField=119&Searchtype=1&SearchAvailableOnly=0&Branch=,0,&PeriodLimit=30&ItemsPerPage=10&SearchData=&autohide=true";


$useragent="Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1";

curl_setopt($ch, CURLOPT_USERAGENT, $userAgent);
curl_setopt($ch, CURLOPT_URL, $linkurl); 
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 50);
$html= curl_exec($ch);
if (!$html) {
echo "<br />cURL error number:" .curl_errno($ch);
echo "<br />cURL error:" . curl_error($ch);
exit;
}

$content = $html;

$PHolder  = 0;
$x = 0;
$y = 0;
$max = strlen($content);
$isbn =  array(300=>0);
$stitle =  array(300=>0);
$sbookcover =  array(300=>0);


while ($x < 200 )
{
$x++;


$start = strpos($content,'isbn',$PHolder+5);//beginning

$start2 = strpos($content,'Branch=,0,"',$start+5);//beginning

$start2 = $start2 -400;

if ($start2 < 0)break;
$start2 = strpos($content,'<a href',$start2);
if ($start2 == "")break;



$start2 = $start2 - 12;

$end2 = strpos($content,'</a>',$start);


$end = strpos($content,'"',$start);
$offset = 13;
$offset2 = $end2 - $start2;

if (substr("$content", $start+5, $offset) != $isbn)
{

if(array_search(substr("$content", $start+5, $offset), $isbn) == 0 )
{
    $y++;
    $isbn[$y] =  substr("$content", $start+5, $offset);

    $sbookcover[$y]="
        <img border=\"0\" width = \"170\" alt=\"Book Jacket\"src=\"http://ls2content.tlcdelivers.com/content.html?customerid=7977&amp;requesttype=bookjacket-lg&amp;isbn=$isbn[$y]&amp;isbn=$isbn[$y]\">
        ";


    $stitle[$y]=   substr("$content", $start2+12, $offset2);

    $bookcover = $sbookcover[$y];

    $title = $stitle[$y]."</a>";
    $stitle[$y] = str_replace("<a href=\"","<a href=\"http://catalog.yourppl.org",$title);

    $stitle[$y] = str_replace("\">","\" rel=\"shadowbox\">",$stitle[$y]);

    $booklinkend = strpos($stitle[$y],"\">");
    $booklink = substr($stitle[$y], 0, $booklinkend+2);


   $sbookcover[$y] = "$booklink".$sbookcover[$y]."</a>";

}

}


$PHolder = $start;


}  



echo"

<table class=\"twocolorformat\" width=\"95%\">



";

$xx = 1;
while ($xy <= 6)
{
$xy++;

echo "

<tr>
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</div></td>
";
$xx++;
echo"
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\" align=\"center\"><div class=\"bookcover\">$sbookcover[$xx]</td>
";
$xx = $xx -2;

echo"
</tr>
<tr>
<td width=\"33%\">$stitle[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\">$stitle[$xx]</td>
";
$xx++;
echo"
<td width=\"33%\">$stitle[$xx]</td>
";
$xx = $xx -2;
echo"
</tr>

";//this is the table row and table data definition. covers and titles are fed to table here.



$xx = $xx +3;
if ($sbookcover[$xx] == "")break;
}


echo"

</table>

";//close your table here


?>

The page that has the link is here:

http://www.catalog.portsmouth.lib.oh.us/limitedsearch.asp

  • We are looking to grab the books and cover images from 'Hot New Items' on that page and work on the rest after we get it working.

If you click the Hot New Items link, the initial url is:

http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?Limits&LimitsId=0&FormId=0&StartIndex=0&Config=pac&ReturnForm=22&Branch=,0,&periodlimit=30&LimitCollection=1&Collection=Adult%20New%20Book&autosubmit=true

but once the page loads, changes to:

http://www.catalog.portsmouth.lib.oh.us/TLCScripts/interpac.dll?NewestSearch&Config=pac&FormId=0&LimitsId=-178&StartIndex=0&SearchField=119&Searchtype=1&SearchAvailableOnly=0&Branch=,0,&PeriodLimit=30&ItemsPerPage=10&SearchData=&autohide=true

Is there anything we can do to get around the expiring links? I can provide more code and explanation if needed.

Thanks very much to anyone who can offer help, Terry


Is there anything we can do to get around the expiring links?

You're interfacing with a system that wasn't designed to be (ab)used in the way you're doing so. Like many search systems, it looks like they're building the results and storing them somewhere. Also like many search systems, those results become invalid after a period of time.

You're going to have to design your code under the assumption that the search results are going to poof into the ether very quickly.

It looks like there's a parameter in the URL that dictates how many results there are per page. Try changing it to a higher number -- a much higher number. They don't seem to have placed a bounds check on it at the code level. I was able to enter 1000 without it complaining, though it only returned 341 links.

Keep in mind that this is very likely going to cause some pretty noticeable load on their machine, and you should be careful and gentle when making your requests. You don't want to raise attention to yourself by making it look like you're attacking their service.


The page returned from the original link generates the results and then sends you a page which uses a javascript that inserts the values into an URL and then sends you to that URL which fetches the stored results page. The results page is identified by the server with a LimitsID (you can see it in the URL of the results page). They must use this number to control how long a page lasts and each request generates a new LimitsID because not every ID works for this results page. Point of all this is, you can use cURL to get the first page (the link off of the original page, which will generate the results and store them on the server), search for the text 'LimitsId=-' in the response page(for some reason they all have a dash in front of them but I'm not sure if they're supposed to be negative as the numbers go up) and paste that text after the same line in the URL that you're using in your script, which will get you to the newly generated results.

However, as pointed out by Charles, these requests will put a significant load on the server so maybe you can just generate a new request when the old one expires.

0

精彩评论

暂无评论...
验证码 换一张
取 消