I am trying to extract some data from a private forum. I created a PHP Script that uses CURL to log in, and DOMDocument to extract page data.
I have successfully logged in with the script, but it acts as if I never logged in as soon as I try to load a web page using loadHTMLFile().
Someone told me that I may need to send cookie headers? But I have no idea how to do that or if it's even necessary.
Anyone have any ideas?
<?
function vBulletinLogin($user, $pass)
{
$md5Pass = md5($pass);
$data = "do=login&url=index.php&vb_login_md5password=$md5Pass&vb_login_username=$user&cookieuser=1";
$ch = curl_init();
curl_setopt ($ch, CURLOPT_URL, "****"); // replace ** with tt
curl_setopt ($ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)");
curl_setopt ($ch, CURLOPT_TIMEOUT, '10');
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch,CURLOPT_POSTFIELDS,$data);
curl_setopt($ch, CURLOPT_COOKIEJAR, "/public_html/phpcrawl/cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEFILE, "/public_html/phpcrawl/cookies.txt");
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
c开发者_运维百科url_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_REFERRER, "****");
$store = curl_exec ($ch);
echo $store; <- **this shows that I have successully logged in, it gives me a welcome message**
print_r($_COOKIE);
curl_close($ch);
$pos = strpos($store, "Thank you for logging in");
if($pos === FALSE) RETURN 0;
else RETURN 1;
}
if(vBulletinLogin("****","****")) echo "Logged In";
else echo "Failed to Login check User / Pass";
$url="http://texturl.com";
echo $url."<br>";
//get new HTML document
$html = new DOMDocument();
$html->loadHTMLFile($url);
print $html->saveHTML(); <- shows a login and password box saying I am not logged in.
I believe you have to use curl each time after to fetch your html pages, you use curl the first time to login, which saves the cookie for being logged in, into its cookie jar. So that the next time you use curl (with the same cookie jar) it will post the cookie data and the server knows you are logged in. Switching to use domdocument I don't believe is going to use curl's cookie jar to say you are logged in.
You'll need to use curl to fetch the html, then you can maybe pass the html to a domdocument and parse it.
精彩评论