开发者

Fix HTML fragment

开发者 https://www.devze.com 2023-02-01 23:31 出处:网络
I\'m trying to learn how to use PHP\'s DOM functions. As an exercise, I want to repair an invalid HTML fragment. So far, I\'ve been able to produce a full document:

I'm trying to learn how to use PHP's DOM functions. As an exercise, I want to repair an invalid HTML fragment. So far, I've been able to produce a full document:

<?php

$fragment = '<div style="font-weight: bold">Lorem ipsum <div>开发者_StackOverflow;dolor sit amet,
    <strong><em class=foo>luptate</strong></em>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.';

$doc = new DOMDocument;
libxml_use_internal_errors(TRUE);
$doc->loadHTML($fragment);
libxml_use_internal_errors(FALSE);
$doc->formatOutput = TRUE;
echo $doc->saveHTML();

?>

... which prints:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
    <strong><em class="foo">luptate</em></strong>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.</div>
</div></body></html>

My questions:

  1. Is there a way to print only the HTML that corresponds to the original fragment?
  2. Is there a more appropriate built-in library for such task?


This should work, but a bit ugly

$doc->loadHTML($fragment);
echo simplexml_import_dom( $doc->getElementsByTagName('div')->item(0) )->asXML();

output:

<div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
  <strong><em class="foo">luptate</em></strong>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.</div></div>

slightly more elegant

$xpath   = new DOMXPath($doc);
$query   = '/html/body/*';        <-- always <html><body>...
$entries = $xpath->query($query);
foreach ($entries as $entry)
{
  echo simplexml_import_dom($entry)->asxml();
}


It seems that latest PHP versions finally implement this:

How to return outer html of DOMDocument?

That way we can do this:

if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
    $body = $dom->documentElement->firstChild;
    if( $body->hasChildNodes() ){
        foreach($body->childNodes as $node){
            echo $dom->saveHTML($node);
        }
    }
}

... or this:

if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
    $body = $dom->getElementsByTagName('body')->item(0);
    if( $body->hasChildNodes() ){
        foreach($body->childNodes as $node){
            echo $dom->saveHTML($node);
        }
    }
}

Too bad we still need an ugly workaround for older versions.


You could run a function to replace the parts that you don't want that always appear such as:

$result = $doc->saveHTML();
$result = str_replace('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body>', '', $result);
$result = str_replace('</body></html>', '', $result);

You could always try this class:

http://www.barattalo.it/html-fixer/

Which apparently will be as easy as this:

$dirty_html = ".....bad html here......";

$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);

It all depends on what you will be doing with the information.


Well, PHP >= 5.1 apparently also has a DocumentFragment, which has an appendXML function: http://php.net/manual/en/domdocumentfragment.appendxml.php. Maybe you can use that? I'm not sure if it has a string representation of itself, but who knows.

EDIT:

Well, that doesn't work :)

What you could do, though, is use SimpleXML, either directly or by creating a DOMElement and then using simplexml_import_dom($domelement)->asXML(): http://php.net/manual/en/function.simplexml-import-dom.php. Good luck! :)

0

精彩评论

暂无评论...
验证码 换一张
取 消