When I use the curl to fetch a page on an ecommerce site, it always gives me the same front page (ignoring the starting item parameter); whereas when I go to the url in a browser it works as usual.
Simplified code:
// s is the starting item count, no idea what yp4p_page is for exactly yet.
$url = 'http://list.taobao.com/market/baobao.htm?cat=40&yp4p_page=4&s=176';
$ch = curl_init($url);
$header[0] = 'Accept: text/xml,application/xml,application/xhtml+xml,'
. 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
//$cookieFile = tempnam('/tmp', 'curlcookie');
$cookieFile = dirname(__FILE__) . DIRECTORY_SEPARATOR . 'curlcookies.txt';
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 FirePHP/0.6',
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_VERBOSE => 1,
CURLOPT_HTTPHEADER => $header,
CURLOPT_COOKIEFILE => $cookieFile,
CURLOPT_COOKIEJAR => $cookieFile,
);
curl_setopt_array($ch, $options);
$strPageHTML = curl_exec($ch);
curl_close($ch);
I'm sorry for the Chinese site, but if you look at the items listed and their url as returned by curl, their id's are always the same as the ones on the front page (where s = 0) when they should be different.
What am I doing wrong?
Edit 1: added cookie to code, still doesn't work.
Edit 2: edited the cookie line to clear any confusion. Also the contents of the cookies are as follows:
# Netscape HTTP Cookie File
# http://curl.haxx.se/rfc/cookie_spec.html
# This file was generated by libcurl! Edit at your own risk.
#HttpOnly_.taobao.com TRUE / FALSE 0 cookie2 d686d4be95b4b56b61292118b43e1333
#HttpOnly_.taobao.com TRUE / FALSE 1316321978 _tb_token_ eeab7e3e5ea9e
.taobao.com TRUE / FALSE 1321505978 t 3c473872e51e93b0cf172375b31f503a
.taobao.com TRUE / FALSE 0 开发者_开发问答 uc1 cookie14=UoLdHCGrCsSKAg%3D%3D
.taobao.com TRUE / FALSE 0 v 0
.taobao.com TRUE / FALSE 0 _lang zh_CN:GBK
You should take a look at cookies generated by the website, or even some CSRF tokens that would be inserted to keep you away from doing some parsing job. When I inspect the webpage at first load, I can find this:
Set-Cookie:cookie2=b1d92ddac8aa82350a6ff5e892a8637d;Domain=.taobao.com;Path=/;HttpOnly
_tb_token_=fde3979ee6b13;Domain=.taobao.com;Path=/;Expires=Sat, 17-Sep-2011 07:09:40 GMT;HttpOnly
t=91f29eb410a21a04bf36025823c4b2ad; Domain=.taobao.com; Expires=Wed, 16-Nov-2011 07:09:40 GMT; Path=/
uc1=cookie14=UoLdHCDBHbn1eg%3D%3D; Domain=.taobao.com; Path=/
Maybe these cookies are used to identify you while navigating through categories.
Searching for "token" in the DOM made some results too.
Instead of accessing the page by pretending to be a user, is it possible to access the information you require via their api (http://open.taobao.com/)?
This page uses a lot of cookies, I would not be surprised a session cookie is required to load the page. See what happens when enabling that
curl_setopt($DATA_POST, CURLOPT_COOKIEFILE, 'cookiefile.txt');
curl_setopt($DATA_POST, CURLOPT_COOKIEJAR, 'cookiefile.txt');
// s is the starting item count, no idea what yp4p_page is for exactly yet.
$url = 'http://list.taobao.com/market/baobao.htm?cat=40&yp4p_page=4&s=176';
$ch = curl_init($url);
$header[0] = 'Accept: text/xml,application/xml,application/xhtml+xml,'
. 'text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5';
$header[] = 'Cache-Control: max-age=0';
$header[] = 'Connection: keep-alive';
$header[] = 'Keep-Alive: 300';
$header[] = 'Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7';
$header[] = 'Accept-Language: en-us,en;q=0.5';
$cookieFile = "cookie_china"; // I've changed this value and it seems to be working fine, I get the same results
$options = array(
CURLOPT_RETURNTRANSFER => true,
CURLOPT_HEADER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_USERAGENT => 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0 FirePHP/0.6',
CURLOPT_AUTOREFERER => true,
CURLOPT_CONNECTTIMEOUT => 120,
CURLOPT_TIMEOUT => 120,
CURLOPT_MAXREDIRS => 10,
CURLOPT_SSL_VERIFYHOST => 0,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_VERBOSE => 1,
CURLOPT_HTTPHEADER => $header,
CURLOPT_COOKIEFILE => $cookieFile,
CURLOPT_COOKIEJAR => $cookieFile,
);
curl_setopt_array($ch, $options);
$strPageHTML = curl_exec($ch);
curl_close($ch);
精彩评论