开发者

Pulling out a full node with child nodes using XPath

开发者 https://www.devze.com 2022-12-08 17:54 出处:网络
I\'m using XPath to select an section from an HTML 开发者_如何学Gopage. However when I use XPath to extract the node, it correctly selects only the text surrounding the HTML tags and not the HTML tags

I'm using XPath to select an section from an HTML 开发者_如何学Gopage. However when I use XPath to extract the node, it correctly selects only the text surrounding the HTML tags and not the HTML tags themselves.

Sample HTML

<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>

I have the following XPath

/body/div

I get the following

At first glance you may ask, &#8220;what do you mean?&#8221; It means that we want to help figure...

I want

At first glance you may ask, &#8220;what <i>exactly</i> do you mean?&#8221; It means that we want to help <b>you</b> figure...

If you notice in the Sample HTML there is a <i/> and <b /> HTML tags in the content. The words within those tags are "lost" when I extract the content.

I'm using SimpleXML in PHP if that makes a difference.


Your XPath is fine, though you can remove the final /. as that's redundant:

/atom/content

All of the HTML is inside of a <![CDATA ]]> section so in the XML DOM you actually only have text there. The <i> and <b> tags will not be parsed as tags but will just show up as text. Using a CDATA section is exactly the same as if your XML were written like this:

<atom>
    <content>
      At first glance you may ask, &amp;#8220;what &lt;i&gt;exactly&lt;/i&gt;
      do you mean?&amp;#8221; It means that we want to help &lt;b&gt;you&lt;/b&gt; figure...
    </content>
</atom>

So, it's whatever you're doing with the <content> element afterwards that's dropping those tags. Are you later parsing the text as HTML, or running it through a filter, or something like that?


SimpleXML doesn't like text nodes so you'll have to use a custom solution instead.

You can use asXML() on each div element then remove the div tags, or you can convert the div elements to DOMNodes then loop over $div->childNodes and serialize each child. Note that your HTML entities will most likely be replaced by the actual characters if available.

Alternatively, you can take a look at the SimpleDOM project and use its innerHTML() method.

$html = 
'<body>
    <div>
      At first glance you may ask, &#8220;what <i>exactly</i>
      do you mean?&#8221; It means that we want to help <b>you</b> figure...
    </div>
</body>';

$body = simpledom_load_string($html);

foreach ($body->xpath('/body/div') as $div)
{
    var_dump($div->innerHTML());
}


I don't know if SimpleXML is different but to me it seems you need to make sure you're selecting all node types and not just text. In standard XPath you would do /body/div/node()

0

精彩评论

暂无评论...
验证码 换一张
取 消