开发者

Unable to get table data from a html page

开发者 https://www.devze.com 2023-03-11 18:39 出处:网络
I am trying to get some data fields in a table in a html webpage. The webpage is dynamically generated on posting some content. I am using php-curl to get the web page and then xpath to ge开发者_Pytho

I am trying to get some data fields in a table in a html webpage. The webpage is dynamically generated on posting some content. I am using php-curl to get the web page and then xpath to ge开发者_Python百科t the data from some fields. I am able to get the page not the specific fields. The code looks like this

$url="http://www.rtu.ac.in/results/reformat.php";
$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,1);
curl_setopt($ch,CURLOPT_POSTFIELDS,$post);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$content=curl_exec($ch);
curl_close($ch);

$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";
$page=new DOMDocument();
$xpath=new DOMXPath($page);
$page->loadHTML($content);
$page->saveHTML();  // this shows the page contents

$total=$xpath->query($totalPath);
echo $total->length;    //shows 0
echo $total->item(0)->nodeValue;   //shows nothing

The xpath is correct as I have checked it with FirePath. What I understand from this is that $xpath->query is not doing is job.


You write:

echo $total->length;    //shows 0

That means that the xpath returned 0 elements. So it's actually not doing what you would like it to do.

//html/body/table[4]/tr[3]/td[4]

Or otherwise check the syntax of your xpath query that you didn't made an error.

Additionally I would first load the HTML document and then initialize the xpath object.

$totalPath="//html/body/table[4]/tr[3]/td[4]";
$page=new DOMDocument();
$page->loadHTML($content);
$xpath=new DOMXPath($page);    
$total=$xpath->query($totalPath);

Edit: Removed tbody as suggested by Wrikken.

EDIT: Enable error reporting incl. warnings so you can ensure that a) the HTML is properly loaded into DomDocument and b) if there is a problem with the XPath you see it.


Got it to run. This is my code:

<?php

$url="http://www.rtu.ac.in/results/reformat.php";
$post="rollnumber=08epccs060&filename=fetchmodulesem_4_btech410m.php&button=Submit";
$ch=curl_init();
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST,1);
curl_setopt($ch,CURLOPT_POSTFIELDS,$post);
curl_setopt($ch,CURLOPT_FOLLOWLOCATION,1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);
$content=curl_exec($ch);
curl_close($ch);

echo 'Size: ', strlen($content), "\n";
echo 'Beginning: ', substr($content, 0, 512), "\n\n";

$page=new DOMDocument();
$page->recover=false;
$page->loadHTML($content);

echo "\nLoaded XML:\n", $page->saveXML($page), "\n";


$xpath=new DOMXPath($page);
$totalPath="html/body/table[4]/tbody/tr[3]/td[4]";

$paths = array(
    '//body',
    '//body/table',
    '//body/table[4]',
    '//body/table[4]/tr',
    '//body/table[4]/tr[3]',
    '//body/table[4]/tr[3]/td',
    '//body/table[4]/tr[3]/td[4]',
    '//html/body/table[4]/tr[3]/td[4]',
);


foreach($paths as $path) {
    $result=$xpath->query($path);
    echo $path, ': ', $result->length, "\n";
}

And this is the output (cutted the top output which was for loading verification only):

//body: 1
//body/table: 4
//body/table[4]: 1
//body/table[4]/tr: 3
//body/table[4]/tr[3]: 1
//body/table[4]/tr[3]/td: 4
//body/table[4]/tr[3]/td[4]: 1
//html/body/table[4]/tr[3]/td[4]: 1

Always returns a length meaning that there is a node at least.


Without looking at the HTML: the /tbody isn't there, and is just added in by Firefox. Remove that portion, and gain a healthy distrust of that tool ;)


edit:

And indeed the order should be:

$page=new DOMDocument();
$page->loadHTML($content);
$xpath=new DOMXPath($page);

As DOMXpath takes snapshots, it doesn't track DOM changes afterwards.

0

精彩评论

暂无评论...
验证码 换一张
取 消