开发者

php: Extract text between specific tags from a webpage [duplicate]

开发者 https://www.devze.com 2023-04-11 21:21 出处:网络
This question already has answers here: Closed 11 years ago. Possible Duplicate: Best methods to parse HTML with PHP
This question already has answers here: Closed 11 years ago.

Possible Duplicate:

Best methods to parse HTML with PHP

I understand I should be using a html parser like php domdocument (http://docs.php.net/manual/en/domdocument.loadhtml.php) or tagsoup.

How would I use php domdocument to extract text between specific tags, for example get text between h1,h2,h3,p,table? It seems I can only do this for one tag only with getelementbytagname.

Is there a better html parser for such task? Or how would I loop through the php开发者_开发百科 domdocument?


You are correct, use DomDocument (since regex is NOT a good idea for parsing HTML. Why? See here and here for reasons why).

getElementsByTagName gives you a DOMNodeList that you can iterate over to get the text of all the found elements. So, your code could look something like:

$document = new \DOMDocument();
$document->loadHTML($html);

$tags = array ('h1', 'h2', 'h3', 'h4', 'p');
$texts = array ();
foreach($tags as $tag)
{
  $elementList = $document->getElementsByTagName($tag);
  foreach($elementList as $element)
  {
     $texts[$element->tagName][] = $element->textContent;
  }
}
return $texts;

Note that you should probably have some error handling in there, and you will also lose the context of the texts, but you can probably edit this code as you see fit.


You can doing so with a regex.

preg_match_all('#<h1>([^<]*)</h1>#Usi', $html_string, $matches);
foreach ($matches as $match)
{
  // do something with $match
}


I am not sure what is your source so I added a function to get the content via the URL.

$file = file_get_contents($url);

$doc = new DOMDocument();
$doc->loadHTML($file);

$body = $doc->getElementsByTagName('body');
$h1 = $body->getElementsByTagName('h1');

I am not sure of this part:

for ($i = 0; $i < $items->length; $i++) {
    echo $items->item($i)->nodeValue . "\n";
}

Or:

foreach ($items as $item) {
    echo $item->nodeValue . "\n";
}

Here is more info on nodeValue: http://docs.php.net/manual/en/function.domnode-node-value.php

Hope it helps!

0

精彩评论

暂无评论...
验证码 换一张
取 消