I'm using XPath to select an section from an HTML 开发者_如何学Gopage. However when I use XPath to extract the node, it correctly selects only the text surrounding the HTML tags and not the HTML tags themselves.
Sample HTML
<body>
<div>
At first glance you may ask, “what <i>exactly</i>
do you mean?” It means that we want to help <b>you</b> figure...
</div>
</body>
I have the following XPath
/body/div
I get the following
At first glance you may ask, “what do you mean?” It means that we want to help figure...
I want
At first glance you may ask, “what <i>exactly</i> do you mean?” It means that we want to help <b>you</b> figure...
If you notice in the Sample HTML there is a <i/>
and <b />
HTML tags in the content. The words within those tags are "lost" when I extract the content.
I'm using SimpleXML in PHP if that makes a difference.
Your XPath is fine, though you can remove the final /.
as that's redundant:
/atom/content
All of the HTML is inside of a <![CDATA ]]>
section so in the XML DOM you actually only have text there. The <i>
and <b>
tags will not be parsed as tags but will just show up as text. Using a CDATA section is exactly the same as if your XML were written like this:
<atom>
<content>
At first glance you may ask, &#8220;what <i>exactly</i>
do you mean?&#8221; It means that we want to help <b>you</b> figure...
</content>
</atom>
So, it's whatever you're doing with the <content>
element afterwards that's dropping those tags. Are you later parsing the text as HTML, or running it through a filter, or something like that?
SimpleXML doesn't like text nodes so you'll have to use a custom solution instead.
You can use asXML()
on each div
element then remove the div
tags, or you can convert the div
elements to DOMNode
s then loop over $div->childNodes
and serialize each child. Note that your HTML entities will most likely be replaced by the actual characters if available.
Alternatively, you can take a look at the SimpleDOM project and use its innerHTML()
method.
$html =
'<body>
<div>
At first glance you may ask, “what <i>exactly</i>
do you mean?” It means that we want to help <b>you</b> figure...
</div>
</body>';
$body = simpledom_load_string($html);
foreach ($body->xpath('/body/div') as $div)
{
var_dump($div->innerHTML());
}
I don't know if SimpleXML is different but to me it seems you need to make sure you're selecting all node types and not just text. In standard XPath you would do /body/div/node()
精彩评论