开发者

Need help scraping webpage -- getting specific content...

开发者 https://www.devze.com 2023-03-23 09:39 出处:网络
I have a table, of whose number of columns can change depending on the configuration of the scrapped page (I have no control of it). I want to get only the information from a specific column, designat

I have a table, of whose number of columns can change depending on the configuration of the scrapped page (I have no control of it). I want to get only the information from a specific column, designated by the columns heading.

Here is a simplified table:

<table>
<tbody>
<tr class='header'>
    <td>Image</td>
    <td>Name</td>
    <td>Time</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 1</td>
    <td>13:02</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 2</td>
    <td>13:43</td>
</tr>
<tr>
    <td><img src='someimage.png' /></td>
    <td>Name 3</td>
    <td>14:53</td>
</tr>
</tbody>
</table>

I want to only extract the names (column 2) of the table. However, as previously stated, the column order cannot be known. The Image column might not be there, for example, in which case the column I want would be the first one.

I was wondering if there's any way to do this with DomDocument/DomXPath. Perhaps search for the string "Name" in the first tr, and find out which column index it is, and then use that to get the info. A less elegant solution would be to see if the first column has an img tag, in which case the image column is first and so we can throw that way and use the next one.

Been looking at it for about an hour and a half, but I'm not familiar to DomDocument functions and manipulation. Having a lot of trouble with thi开发者_StackOverflow中文版s one.


Simple HTML DOM Parser may be useful. You can check the manual. Basically you should use something like;

$url = "file url";
$html = file_get_html($url);
$header = $html->find('tr.header td');
$i = 0;
foreach ($header as $element){
 if ($element->innerText == 'Image') { $num = $i; }
 $i++;
}

We found which column ($num) is image column. You can add additional codes to improve.

PS: Easy way to find all image sources;

$images = $html->find('tr td img');
foreach ($images as $image){
 $imageUrl[] = $image->src;
}
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号