开发者

PHP DOMDocument - get html source of BODY

开发者 https://www.devze.com 2022-12-21 04:52 出处:网络
I\'m using PHP\'s DOMDocument to parse and normalize user-submitted HTML using the loadHTML method to parse the content then getting a well-formed result via saveHTML:

I'm using PHP's DOMDocument to parse and normalize user-submitted HTML using the loadHTML method to parse the content then getting a well-formed result via saveHTML:

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML(); 
echo($well_formed);

This does a beautiful job of parsing the fragment and adding the appropriate closing tags. The problem is that I'm also getting a bunch of tags I don't want such as <!DOCTYPE>, <html>, <head> and <body>. I understand that every well-formed HTML document needs these tags, but the HTML fragment I'm normalizing is going to be in开发者_高级运维serted into an existing valid document.


The quick solution to your problem is to use an xPath expression to grab the body.

$dom= new DOMDocument();
$dom->loadHTML('<div><p>Hello World');      
$xpath = new DOMXPath($dom);
$body = $xpath->query('/html/body');
echo($dom->saveXml($body->item(0)));

A word of warning here. Sometimes loadHTML will throw a warning when it encounters certainly poorly formed HTML documents. If you're parsing those kind of HTML documents, you'll need to find a better html parser [self link warning].


IN your case, you do not want to work with an HTML document, but with an HTML fragment -- a portion of HTML code ;; which means DOMDocument is not quite what you need.

Instead, I would rather use something like HTMLPurifier (quoting) :

HTML Purifier is a standards-compliant HTML filter library written in PHP. HTML Purifier will not only remove all malicious code (better known as XSS) with a thoroughly audited, secure yet permissive whitelist, it will also make sure your documents are standards compliant, something only achievable with a comprehensive knowledge of W3C's specifications.

And, if you try your portion of code :

<div><p>Hello World

Using the demo page of HTMLPurifier, you get this clean HTML as an output :

<div><p>Hello World</p></div>

Much better, isn't it ? ;-)

(Note that HTMLPurfier suppots a wide range of options, and that taking a look at its documentation might not hurt)


Faced with the same problem, I've created a wrapper around DOMDocument called SmartDOMDocument to overcome this and some other shortcomings (such as encoding problems).

You can find it here: http://beerpla.net/projects/smartdomdocument


This was taken from another post and worked perfectly for my use:

$layout = preg_replace('~<(?:!DOCTYPE|/?(?:html|head|body))[^>]*>\s*~i', '', $layout);


TL;DR: $dom->saveHTML($dom->documentElement->lastChild);
Where $dom->documentElement->lastChild is the body-node but could be every other available DOMNode of the document.


Actucally the DOMDocument::saveHTML-method itself is capable of doing what you want. It takes a DOMNode-object as the first argument to output a subset of the document.

$dom = new DOMDocument();
$dom->loadHTML('<div><p>Hello World');
$well_formed= $dom->saveHTML($dom->documentElement->lastChild); 
echo($well_formed);

There are several ways of retrieving the body-node. Here are 2:

$bodyNode = $dom->documentElement->lastChild;
$bodyNode = $dom->getElementsByTagName('body')->item(0);
From the PHP Manual

public DOMDocument::saveHTML(?DOMNode $node = null): string|false
Parameters
node
Optional parameter to output a subset of the document.

https://www.php.net/manual/en/domdocument.savehtml.php

0

精彩评论

暂无评论...
验证码 换一张
取 消