I am trying to get some data fields in a table in a html webpage. The webpage is dynamically generated on posting some content. I am using php-curl
to get the web page and then xpath
to ge开发者_Python百科t the data from some fields. I am able to get the page not the specific fields. The code looks like this
$url="http://www.rtu.ac.in/results/reformat.php";
$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,1);
curl_setopt($ch,CURLOPT_POSTFIELDS,$post);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$content=curl_exec($ch);
curl_close($ch);
$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";
$page=new DOMDocument();
$xpath=new DOMXPath($page);
$page->loadHTML($content);
$page->saveHTML(); // this shows the page contents
$total=$xpath->query($totalPath);
echo $total->length; //shows 0
echo $total->item(0)->nodeValue; //shows nothing
The xpath
is correct as I have checked it with FirePath
. What I understand from this is that $xpath->query
is not doing is job.
You write:
echo $total->length; //shows 0
That means that the xpath returned 0 elements. So it's actually not doing what you would like it to do.
//html/body/table[4]/tr[3]/td[4]
Or otherwise check the syntax of your xpath query that you didn't made an error.
Additionally I would first load the HTML document and then initialize the xpath object.
$totalPath="//html/body/table[4]/tr[3]/td[4]";
$page=new DOMDocument();
$page->loadHTML($content);
$xpath=new DOMXPath($page);
$total=$xpath->query($totalPath);
Edit: Removed tbody as suggested by Wrikken.
EDIT: Enable error reporting incl. warnings so you can ensure that a) the HTML is properly loaded into DomDocument and b) if there is a problem with the XPath you see it.
Got it to run. This is my code:
<?php
$url="http://www.rtu.ac.in/results/reformat.php";
$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,1);
curl_setopt($ch,CURLOPT_POSTFIELDS,$post);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$content=curl_exec($ch);
curl_close($ch);
echo 'Size: ', strlen($content), "\n";
echo 'Beginning: ', substr($content, 0, 512), "\n\n";
$page=new DOMDocument();
$page->recover=false;
$page->loadHTML($content);
echo "\nLoaded XML:\n", $page->saveXML($page), "\n";
$xpath=new DOMXPath($page);
$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";
$paths = array(
'//body',
'//body/table',
'//body/table[4]',
'//body/table[4]/tr',
'//body/table[4]/tr[3]',
'//body/table[4]/tr[3]/td',
'//body/table[4]/tr[3]/td[4]',
'//html/body/table[4]/tr[3]/td[4]',
);
foreach($paths as $path) {
$result=$xpath->query($path);
echo $path, ': ', $result->length, "\n";
}
And this is the output (cutted the top output which was for loading verification only):
//body: 1
//body/table: 4
//body/table[4]: 1
//body/table[4]/tr: 3
//body/table[4]/tr[3]: 1
//body/table[4]/tr[3]/td: 4
//body/table[4]/tr[3]/td[4]: 1
//html/body/table[4]/tr[3]/td[4]: 1
Always returns a length meaning that there is a node at least.
Without looking at the HTML: the /tbody
isn't there, and is just added in by Firefox. Remove that portion, and gain a healthy distrust of that tool ;)
edit:
And indeed the order should be:
$page=new DOMDocument();
$page->loadHTML($content);
$xpath=new DOMXPath($page);
As DOMXpath takes snapshots, it doesn't track DOM changes afterwards.
精彩评论