PHP: Display the first 500 characters of HTML_问答_开发者

I have a huge HTML code in a PHP variable like :

$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.&开发者_开发知识库lt;/div><br><span>Another sample text.</span>....';

I want to display only first 500 characters of this code. This character count must consider the text in HTML tags and should exclude HTMl tags and attributes while measuring the length. but while triming the code, it should not affect DOM structure of HTML code.

Is there any tuorial or working examples available?

If its the text you want, you can do this with the following too

substr(strip_tags($html_code),0,500);

Ooohh... I know this I can't get it exactly off the top of my head but you want to load the text you've got as a DOMDOCUMENT

http://www.php.net/manual/en/class.domdocument.php

then grab the text from the entire document node (as a DOMnode http://www.php.net/manual/en/class.domnode.php)

This won't be exactly right, but hopefully this will steer you onto the right track. Try something like:

 $html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
 $dom = new DOMDocument();
 $dom->loadHTML($html_code);
 $text_to_strip = $dom->textContent;
 $stripped = mb_substr($text_to_strip,0,500);
 echo "$stripped";  // The Sameple text.Another sample text.....

edit ok... that should work. just tested locally

edit2

Now that I understand you want to keep the tags, but limit the text, lets see. You're going to want to loop the content until you get to 500 characters. This is probably going to take a few edits and passes for me to get right, but hopefully I can help. (sorry I can't give undivided attention)

First case is when the text is less than 500 characters. Nothing to worry about. Starting with the above code we can do the following.

  if (strlen($stripped) > 500) {
       // this is where we do our work.

       $characters_so_far = 0;
       foreach ($dom->child_nodes as $ChildNode) {

          // should check if $ChildNode->hasChildNodes();
          // probably put some of this stuff into a function
          $characters_in_next_node += str_len($ChildNode->textcontent);
          if ($characters_so_far+$characters_in_next_node > 500) { 
              // remove the node 
              // try using 
              // $ChildNode->parentNode->removeChild($ChildNode);
          } 
          $characters_so_far += $characters_in_next_node
       }
       // 
       $final_out = $dom->saveHTML();
  } else {
        $final_out = $html_code;
  }

i'm pasting below a php class i wrote a long time ago, but i know it works. its not exactly what you're after, as it deals with words instead of a character count, but i figure its pretty close and someone might find it useful.

  class HtmlWordManipulator
  {
    var $stack = array();

    function truncate($text, $num=50) 
    { 
      if (preg_match_all('/\s+/', $text, $junk) <= $num) return $text; 
      $text = preg_replace_callback('/(<\/?[^>]+\s+[^>]*>)/','_truncateProtect', $text); 
      $words = 0; 
      $out = array();
      $text = str_replace('<',' <',str_replace('>','> ',$text));
      $toks = preg_split('/\s+/', $text);
      foreach ($toks as $tok) 
      { 
        if (preg_match_all('/<(\/?[^\x01>]+)([^>]*)>/',$tok,$matches,PREG_SET_ORDER))  
          foreach ($matches as $tag) $this->_recordTag($tag[1], $tag[2]);  
        $out[] = trim($tok);
        if (! preg_match('/^(<[^>]+>)+$/', $tok))
        {
          if (!strpos($tok,'=') && !strpos($tok,'<') && strlen(trim(strip_tags($tok))) > 0) 
          {
           ++$words; 
          }
          else
          {                 
            /*
            echo '<hr />';
            echo htmlentities('failed: '.$tok).'<br /)>'; 
            echo htmlentities('has equals: '.strpos($tok,'=')).'<br />';
            echo htmlentities('has greater than: '.strpos($tok,'<')).'<br />';
            echo htmlentities('strip tags: '.strip_tags($tok)).'<br />';
            echo str_word_count($text);
            */
          } 
        }
        if ($words > $num) break; 
      } 
      $truncate = $this->_truncateRestore(implode(' ', $out));   
      return $truncate; 
    }

    function restoreTags($text)
    {
      foreach ($this->stack as $tag) $text .= "</$tag>";
      return $text;
    } 

    private function _truncateProtect($match) 
    { 
      return preg_replace('/\s/', "\x01", $match[0]); 
    } 

    private function _truncateRestore($strings) 
    { 
      return preg_replace('/\x01/', ' ', $strings); 
    }

    private function _recordTag($tag, $args) 
    { 
      // XHTML 
      if (strlen($args) and $args[strlen($args) - 1] == '/') return; 
      else if ($tag[0] == '/') 
      { 
        $tag = substr($tag, 1); 
        for ($i=count($this->stack) -1; $i >= 0; $i--) { 
         if ($this->stack[$i] == $tag) { 
           array_splice($this->stack, $i, 1); 
           return; 
         } 
        } 
        return; 
      } 
      else if (in_array($tag, array('p', 'li', 'ul', 'ol', 'div', 'span', 'a'))) 
        $this->stack[] = $tag;  
      else return;
    } 
  }

truncate is what you want, and you pass it the html and the number of words you want it trimmed down to. it ignores html while counting words, but then rewraps everything in html, even closing trailing tags due to the truncation.

please don't judge me on the complete lack of oop principles. i was young and stupid.

edit:

so it turns out the usage is more like this:

$content = $manipulator->restoreTags($manipulator->truncate($myHtml,$numOfWords));

stupid design decision. allowed me to inject html inside the unclosed tags though.

I'm not up to coding a real solution, but if someone wants to, here's what I'd do (in pseudo-PHP):

$html_code = '<div class="contianer" style="text-align:center;">The Sameple text.</div><br><span>Another sample text.</span>....';
$aggregate = '';

$document = XMLParser($html_code);

foreach ($document->getElementsByTagName('*') as $element) {
  $aggregate .= $element->text(); // This is the text, not HTML. It doesn't
                                  // include the children, only the text
                                  // directly in the tag.
}