Parse HTML with PHP to get sibling elements grouped by class_问答_开发者

Parse HTML with PHP to get sibling elements grouped by class

开发者 https://www.devze.com 2023-04-01 07:18 出处：网络

I have a HUGE HTML document that I need to parse. The document is a list of <p> elements all (direct) children of the body tag.

I have a HUGE HTML document that I need to parse. The document is a list of <p> elements all (direct) children of the body tag. The difference is the class name. The structure is like this:

    <p class="first-level"></p>
    <p class="second-level"></p>
    <p class="third-level"></p>
    <p class="third-level"></p>
    <p class="nth-levels just-for-demo-1"></p>
    <p class="nth-levels just-for-demo-1"></p>
    <p class="third-level"></p>
    <p class="second-level"></p>
    <p class="third-level"></p>
    <p class="nth-levels just-for-demo-2"></p>
    <p class="first-level"></p>
    <p class="second-level"></p>
    <p class="second-level"></p>
    <p class="third-level"></p>

And so on. nth-level can be a开发者_StackOverflowny class name that isn't first-level, second-level or third-level. Basically it's a multi-level <ul> element very poorly marked-up.

What I want to do is parse it and obtain all <p> elements (including tag, not just innerHTML) that are between one of the class names above.

In the example above, I want to get:

<p class="nth-levels just-for-demo-1"></p>
<p class="nth-levels just-for-demo-1"></p>

and

<p class="nth-levels just-for-demo-2"></p>

How the heck can I do that please? Thank you.

Using XPath:

//p[not(@class='first-level')][not(@class='second-level')][not(@class='third-level')]

to get the (non?)matching nodes, then you can use this answerto get the outerHTML of the nodes.

Additionaly, if you're familiar with jQuery, then try jQuery port to PHP and you could have a powerful set of tools for matching a set of elements in a document (Selectors) as you used to be with jQuery along side with Hierarchy, Attribute Filters, Child Filters etc,Reference

$doc = new DOMDocument;
$doc->loadHTML(...);
$query = '//p[contains(@class, "just-for-demo-")]';
$xpath = new DOMXPath($doc);
$entries = $xpath->query($query);

foreach ($entries as $entry)
{
  // not a best solution yet
  $attribute = '';
  foreach ($entry->attributes as $attr)
  {
    $attribute .= "{$attr->name}=\"{$attr->value}\"";
  }

  echo "<{$entry->nodeName}{$attribute}>{$entry->nodeValue}</{$entry->nodeName}>";
}

You could open the file (with fopen or something similar) and read one line at a time. Then just check if the required string is in the line (for example with strstr) and if yes, then add it to an array or do what you need with the line. Note: this only works if the paragraphs are on different lines each.

fopen documentation

strstr documentation