开发者

PHP Page Scraping - cUrl Redirect Problem

开发者 https://www.devze.com 2023-02-01 01:28 出处:网络
I\'m trying to scrape this link: https://www.bu.edu/li开发者_如何学JAVAnk/bin/uiscgi_studentlink/1293403322?College=SMG&Dept=AC&Course=222&Section=C1&Subject=ACCT &MtgDay=&MtgT

I'm trying to scrape this link: https://www.bu.edu/li开发者_如何学JAVAnk/bin/uiscgi_studentlink/1293403322?College=SMG&Dept=AC&Course=222&Section=C1&Subject=ACCT &MtgDay=&MtgTime=&ModuleName=univschr.pl&KeySem=20114&ViewSem=Spring+2011&SearchOptionCd=C&SearchOptionDesc=Class+Subject&MainCampusInd=. (It works fine if you access it in the browser.)

So I cUrl it, using this code:

function curl_classes($url){
  $ch = curl_init();
  $userAgent = 'Googlebot/2.1 (http://www.googlebot.com/bot.html)';
  curl_setopt($ch,CURLOPT_USERAGENT, $userAgent);
  curl_setopt($ch, CURLOPT_COOKIEJAR, "cookie.txt");
  curl_setopt($ch, CURLOPT_COOKIEFILE, "cookie.txt");
  curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
  curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
  echo "NOW IM REALY GOING TO: " . $url;
  curl_setopt($ch,CURLOPT_URL,$url);
  curl_setopt($ch, CURLOPT_FAILONERROR, true);
  curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);
  curl_setopt($ch, CURLOPT_AUTOREFERER, true);
  curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
  curl_setopt($ch, CURLOPT_TIMEOUT, 50);
  curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);

  $html = curl_exec($ch);
  curl_close($ch);
  unset($ch);
  if (!$html) {
    echo "<br />cURL error number:" .curl_errno($ch);
    echo "<br />cURL error:" . curl_error($ch);
    exit;
  }
  echo htmlspecialchar($html);
} 

EDIT

Okay, new problem. My cookie storing code doesn't seem to be working. I'm able to scrape this like as desired: bu[DOT]edu/link/bin/uiscgi_studentlink/1293357973?ModuleName=univschr.pl&SearchOptionDesc=Class+Subject&SearchOptionCd=C&KeySem=20114&ViewSem=Spring+2011&Subject=ACCT&MtgDay=&MtgTime=

But when I try to scrape the link at the top of this post I get: "Sorry you need cookies enabled..."

What am I doing wrong in my cookie storing code?


I'm betting that you do access the HTML. It prints the HTML to the screen, and that HTML includes code that redirects you to a new page.

Try outputting an encoded version of the HTML, so that the browser interprets it as plain text:

echo htmlspecialchars($html);

However, looking at your actual code: please do not pretend to be Google. You are not the Googlebot, so your script should not say that you are. If you include any user agent at all (and I recommend that you do), make it reflect your identity, in case the site owner hits issues with your bot. No need to be shady :)


Since you're echoing the contents out in the browser, any javascript in the remote page will be executed. Presumably something is redirecting the page.


You can write the html into a file and then open that in an editor if you have annoying javascript. Or just disable JS in your browser.

0

精彩评论

暂无评论...
验证码 换一张
取 消