开发者

In PHP, how do you scrape a DOMDocument for a certain text pattern, then get the parent element of that matching text's text node?

开发者 https://www.devze.com 2023-02-16 03:31 出处:网络
I\'ve built a simple web scraping utility with PHP and cURL, and have been using code like this to grab certain elements of the scr开发者_开发百科aped page by ID, or by Tag Name where no ID is present

I've built a simple web scraping utility with PHP and cURL, and have been using code like this to grab certain elements of the scr开发者_开发百科aped page by ID, or by Tag Name where no ID is present on the desired element:

$dom = new DOMDocument();
@$dom->loadHTML($response);
$table = $dom->getElementsByTagName('table')->item(4);
$response = $dom->saveXML($table);

Now I've run into a dilemma where I need to go one step further and find the parent element of a certain string or regex pattern of text, because the the site from which I need to collect data doesn't any IDs or classes in the HTML elements I need to extract data from, and various pages may have data organized in different ways, so I can't always rely on the data being in table #X. The only sure-fire way to get the data I'm after off this site is to look for it by its text format, which is always going to be a numeric list starting with "1. " They don't use ordered lists either, or it would be much simpler. It's just a simple table cell with numeric lines separated by a simple <br>.

So I was thinking, if I could find the "1. " then it's parent element would be the table cell <td> which, after finding it, then I would need to extract its content and perhaps the content of any other adjacent table cells in that table row. There are no other instances of "1. " that I could find in the page or the HTML code, so this approach seems reasonable, if not a bit hacky, but I digress.

So, what's the best way to approach something like this?


You could always try an XPath query like the following (assuming the content you're after is always in a table cell)

$xpath = new DOMXPath($dom);
$cells = $xpath->query('//table/tr/td[contains(.,"1. ")]');
if ($cells->length > 0) {
    // get first item
    $cell = $cells->item(0);
    echo $cell->nodeValue; // text content only
    echo $dom->saveXML($cell); // <td>1. ... </td>
}
0

精彩评论

暂无评论...
验证码 换一张
取 消