开发者

How can I convert a docx document to html using php?

开发者 https://www.devze.com 2023-03-19 10:49 出处:网络
I want to be able to upload an MS word document and export it a page in my site.开发者_StackOverflow

I want to be able to upload an MS word document and export it a page in my site.

开发者_StackOverflow

Is there any way to accomplish this?


//FUNCTION :: read a docx file and return the string
function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
            // Return data without XML formatting tags

            $contents = explode('\n',strip_tags($xml->saveXML()));
            $text = '';
            foreach($contents as $i=>$content) {
                $text .= $contents[$i];
            }
            return $text;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}

ZipArchive and DOMDocument are both inside PHP so you don't need to install/include/require additional libraries.


One may use PHPDocX.

It has support for practically all HTML CSS styles. Moreover you may use templates to add extra formatting to your HTML via the replaceTemplateVariableByHTML.

The HTML methods of PHPDocX also allow for the direct use of Word styles. You may use something like this:

$docx->embedHTML($myHTML, array('tableStyle' => 'MediumGrid3-accent5PHPDOCX'));

If you want that all your tables use the MediumGrid3-accent5 Word style. The embedHTML method as well as its version for templates (replaceTemplateVariableByHTML) preserve inheritance, meaning by that that you may use a predefined Word style and override with CSS any of its properties.

You may also extract selected parts of your HTML using 'JQuery type' selectors.


You can convert Word docx documents to html using Print2flash library. Here is an PHP excerpt from my client's site which converts a document to html:

include("const.php");
$p2fServ = new COM("Print2Flash4.Server2");
$p2fServ->DefaultProfile->DocumentType=HTML5;
$p2fServ->ConvertFile($wordfile,$htmlFile);

It converts a document which path is specified in $wordfile variable to a html page file specified by $htmlFile variable. All formatting, hyperlinks and charts are retained. You can get the required const.php file altogether with a fuller sample from Print2flash SDK.


this is a workaround based on David Lin's answer above removing "w:" in a docx's xml tags leave behing Html like tags

    function readDocx($filePath) {
    // Create new ZIP archive
    $zip = new ZipArchive;
    $dataFile = 'word/document.xml';
    // Open received archive file
    if (true === $zip->open($filePath)) {
        // If done, search for the data file in the archive
        if (($index = $zip->locateName($dataFile)) !== false) {
            // If found, read it to the string
            $data = $zip->getFromIndex($index);
            // Close archive file
            $zip->close();
            // Load XML from a string
            // Skip errors and warnings
            $xml = new DOMDocument("1.0", "utf-8");
            $xml->loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING|LIBXML_PARSEHUGE);
            $xml->encoding = "utf-8";
            // Return data without XML formatting tags
            $output =  $xml->saveXML();
            $output = str_replace("w:","",$output);

            return $output;
        }
        $zip->close();
    }
    // In case of failure return empty string
    return "";
}


Ok Im in very late, but thought I'd post this to save you all some time. This is some php code I have put together not just to read the text from docx but the images too, currently it does not support floating images / text, but what I have done so far is a massive move forwards to whats already been posted on here - note you need to update https://example.co.uk to YOUR domain name.

<?php

class Docx_ws_imglnk {
    public $originalpath = '';
    public $extractedpath = '';
}

class Docx_ws_rel {
    public $Id = '';
    public $Target = '';
}

class Docx_ws_def {
    public $styleId = '';
    public $type = '';
    public $color = '000000';
}

class Docx_p_def {
    public $data = array();
    public $text = "";
}

class Docx_p_item {
    public $name = "";
    public $value = "";
    public $innerstyle = "";
    public $type = "text";
}

class Docx_reader {

    private $fileData = false;
    private $errors = array();
    public $rels = array();
    public $imglnks = array();
    public $styles = array();
    public $document = null;
    public $paragraphs = array();
    public $path = '';
    private $saveimgpath = 'docimages';

    public function __construct() {
    
    }

    private function load($file) {
        if (file_exists($file)) {
            $zip = new ZipArchive();
            $openedZip = $zip->open($file);
            if ($openedZip === true) {
            
                $this->path = $file;
            
                //read and save images
                for ( $i = 0; $i < $zip->numFiles; $i ++ ) {
                    $zip_element = $zip->statIndex( $i );
                    if ( preg_match( "([^\s]+(\.(?i)(jpg|jpeg|png|gif|bmp))$)", $zip_element['name'] ) ) {
                        $imglnk = new Docx_ws_imglnk;
                        $imglnk->originalpath = $zip_element['name'];
                        $imagename                   = explode( '/',   $zip_element['name'] );
                        $imagename                   = end( $imagename );
                        $imglnk->extractedpath = dirname( __FILE__ ) . '/' . $this->savepath . $imagename;
                
                        $putres = file_put_contents( $imglnk->extractedpath, $zip->getFromIndex( $i ));
                        $imglnk->extractedpath = str_replace('var/www/', 'https://example.co.uk/', $imglnk->extractedpath);
                        $imglnk->extractedpath = substr($imglnk->extractedpath, 1);
                    
                        array_push($this->imglnks, $imglnk);
                    }
                }
            
                //read relationships
                if (($styleIndex = $zip->locateName('word/_rels/document.xml.rels')) !== false) {
                    $stylesRels = $zip->getFromIndex($styleIndex);
                    $xml = simplexml_load_string($stylesRels);
                    $XMLTEXT = $xml->saveXML();
                    $doc = new DOMDocument();
                    $doc->loadXML($XMLTEXT);
                    foreach($doc->documentElement->childNodes as $childnode)
                    {
                        $nodename = $childnode->nodeName;
                   
                        if($childnode->hasAttributes())
                        {
                            $rel = new Docx_ws_rel;
                            for ($a = 0; $a < $childnode->attributes->count(); $a++)
                            {
                                $attrNode = $childnode->attributes->item($a);
                            
                                if (strcmp( $attrNode->nodeName, 'Id') == 0)
                                {
                                    $rel->Id = $attrNode->nodeValue;
                                }
                                if (strcmp( $attrNode->nodeName, 'Target') == 0)
                                {
                                    $rel->Target = $attrNode->nodeValue;
                                }
                            }
                            array_push($this->rels, $rel);
                        }
                    }
                }
            
                //attempt to load styles:
                if (($styleIndex = $zip->locateName('word/styles.xml')) !== false) {
                    $stylesXml = $zip->getFromIndex($styleIndex);
                    $xml = simplexml_load_string($stylesXml);
                    $XMLTEXT = $xml->saveXML();
                    $doc = new DOMDocument();
                    $doc->loadXML($XMLTEXT);
               
                    foreach($doc->documentElement->childNodes as $childnode)
                    {
                        $nodename = $childnode->nodeName;
                    
                        //get style
                        if (strcmp($nodename, "w:style") == 0)
                        {
                            $ws_def = new Docx_ws_def;
                            for ($a=0; $a < $childnode->attributes->count(); $a++ )
                            {
                                $item = $childnode->attributes->item($a);
                                //style id
                                if (strcmp($item->nodeName, "w:styleId") == 0)
                                {
                                    $ws_def->styleId = $item->nodeValue;
                                }
                            
                                //style type
                                if (strcmp($item->nodeName, "w:type") == 0)
                                {
                                    $ws_def->type = $item->nodeValue;
                                }
                            }
                        }
                        //push style to the array of styles
                        if (strcmp($ws_def->styleId, "") != 0 && strcmp($ws_def->type, "") != 0)
                        {
                            array_push($this->styles, $ws_def);
                        }
                    }
                }

                if (($index = $zip->locateName('word/document.xml')) !== false) {
                    $stylesDoc = $zip->getFromIndex($index);
                    $xml = simplexml_load_string($stylesDoc);
                    $XMLTEXT = $xml->saveXML();
                    $this->document = new DOMDocument();
                    $this->document->loadXML($XMLTEXT);
                }
                $zip->close();
            } else {
                switch($openedZip) {
                    case ZipArchive::ER_EXISTS:
                        $this->errors[] = 'File exists.';
                        break;
                    case ZipArchive::ER_INCONS:
                        $this->errors[] = 'Inconsistent zip file.';
                        break;
                    case ZipArchive::ER_MEMORY:
                        $this->errors[] = 'Malloc failure.';
                        break;
                    case ZipArchive::ER_NOENT:
                        $this->errors[] = 'No such file.';
                        break;
                    case ZipArchive::ER_NOZIP:
                        $this->errors[] = 'File is not a zip archive.';
                        break;
                    case ZipArchive::ER_OPEN:
                        $this->errors[] = 'Could not open file.';
                        break;
                    case ZipArchive::ER_READ:
                        $this->errors[] = 'Read error.';
                        break;
                    case ZipArchive::ER_SEEK:
                        $this->errors[] = 'Seek error.';
                        break;
                }
            }
        } else {
            $this->errors[] = 'File does not exist.';
        }
    }

    public function setFile($path) {
        $this->fileData = $this->load($path);
    }

    public function to_plain_text() {
        if ($this->fileData) {
            return strip_tags($this->fileData);
        } else {
            return false;
        }
    }

    public function processDocument() {
        $html = '';    
    
        foreach($this->document->documentElement->childNodes as $childnode)
        {
            $nodename = $childnode->nodeName;
        
            //get the body of the document
            if (strcmp($nodename, "w:body") == 0)
            {
                foreach($childnode->childNodes as $subchildnode)
                {
                    $pnodename = $subchildnode->nodeName;
                
                    //process every paragraph
                    if (strcmp($pnodename, "w:p") == 0)
                    {
                        $pdef = new Docx_p_def;
                    
                        foreach($subchildnode->childNodes as $pchildnode)
                        {
                            //process any inner children
                            if (strcmp($pchildnode, "w:pPr") == 0)
                            {
                                foreach($pchildnode->childNodes as $prchildnode)
                                {
                                    //process text alignment
                                    if (strcmp($prchildnode->nodeName, "w:pStyle") == 0)
                                    {
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'styleId';
                                        $pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
                                        array_push($pdef->data, $pitem);
                                    }
                                
                                    //process text alignment
                                    if (strcmp($prchildnode->nodeName, "w:jc") == 0)
                                    {
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'align';
                                        $pitem->value = $prchildnode->attributes->getNamedItem('val')->nodeValue;
                                    
                                        if (strcmp($pitem->value, "left") == 0)
                                        {
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        }
                                    
                                        if (strcmp($pitem->value, "center") == 0)
                                        {
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        }
                                    
                                        if (strcmp($pitem->value, "right") == 0)
                                        {
                                            $pitem->innerstyle .= "text-align:" . $pitem->value . ";";
                                        }
                                    
                                        if (strcmp($pitem->value, "both") == 0)
                                        {
                                            $pitem->innerstyle .= "word-spacing:" . 10 . "px;";
                                        }
                                    
                                        array_push($pdef->data, $pitem);
                                    }
                                
                                    //process drawing
                                    if (strcmp($prchildnode->nodeName, "w:drawing") == 0)
                                    {
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'drawing';
                                        $pitem->value = '';
                                        $pitem->type = 'graphic';
                                    
                                        $extents = $prchildnode->getElementsByTagName('extent')[0];
                                        $cx = $extents->attributes->getNamedItem('cx')->nodeValue;
                                        $cy = $extents->attributes->getNamedItem('cy')->nodeValue;
                                        $pcx = (int)$cx / 9525;
                                        $pcy = (int)$cy / 9525;
                                    
                                        $pitem->innerstyle .= "width:" . $pcx . "px;";
                                        $pitem->innerstyle .= "height:" . $pcy . "px;";
                                    
                                        $blip = $prchildnode->getElementsByTagName('blip')[0];
                                        $pitem->value = $blip->attributes->getNamedItem('embed')->nodeValue;
                                 
                                        array_push($pdef->data, $pitem);
                                    }
                                
                                    //process spacing
                                    if (strcmp($prchildnode->nodeName, "w:spacing") == 0)
                                    {
                                        $pitem = new Docx_p_item;
                                        $pitem->name = 'paragraphSpacing';
                                        $bval = $prchildnode->attributes->getNamedItem('before')->nodeValue;
                                        if (strcmp($bval, '') == 0)
                                            $bval = 0;
                                        $pitem->innerstyle .= "padding-top:" . $bval . "px;";
                                        $aval = $prchildnode->attributes->getNamedItem('after')->nodeValue;
                                        if (strcmp($aval, '') == 0)
                                            $aval = 0;
                                        $pitem->innerstyle .= "padding-bottom:" . $aval . "px;";
                                    
                                        array_push($pdef->data, $pitem);
                                    }
                                }
                            }
                        
                        
                            if (strcmp($pchildnode, "w:r") == 0)
                            {
                                foreach($pchildnode->childNodes as $rchildnode)
                                {
                                    //process text
                                    if (strcmp($rchildnode->nodeName, "w:t") == 0)
                                    {
                                        $pdef->text .= $rchildnode->nodeValue;
                                        if (count($pdef->data) == 0)
                                        {
                                            $pitem = new Docx_p_item;
                                            $pitem->name = 'styleId';
                                            $pitem->value = '';
                                            array_push($pdef->data, $pitem);
                                        }
                                    }
                                
                                    if (strcmp($rchildnode->nodeName, "w:rPr") == 0)
                                    {
                                        foreach($rchildnode->childNodes as $rPrchildnode)
                                        {
                                            if (strcmp($rPrchildnode->nodeName, "w:b") == 0 )
                                            {
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textBold';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-weight: 500;";
                                                array_push($pdef->data, $pitem);
                                            }
                                            if (strcmp($rPrchildnode->nodeName, "w:i") == 0 )
                                            {
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textItalic';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-style: italic;";
                                                array_push($pdef->data, $pitem);
                                            }
                                            if (strcmp($rPrchildnode->nodeName, "w:u") == 0 )
                                            {
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textUnderline';
                                                $pitem->value = '';
                                                $pitem->innerstyle .= "text-decoration: underline;";
                                                array_push($pdef->data, $pitem);
                                            }
                                            if (strcmp($rPrchildnode->nodeName, "w:sz") == 0 )
                                            {
                                                $pitem = new Docx_p_item;
                                                $pitem->name = 'textSize';
                                            
                                                $sz = $rPrchildnode->attributes->getNamedItem('val')->nodeValue;
                                                if ($sz == '')
                                                {
                                                    $sz=0;
                                                }
                                                $pitem->value = $sz;
                                                array_push($pdef->data, $pitem);
                                            }
                                        }
                                    }
                                }
                            }
                        }
                  
                       array_push($this->paragraphs, $pdef);
                    }
                }
            }
        } 
    
    }

    public function to_html()
    {
        $html = '';
    
        foreach($this->paragraphs as $para)
        {
            $styleselect = null;
            $type = 'text';
            $content = $para->text;
            $sz = 0;
            $extent = '';
            $embedid = '';
        
            $pinnerstylesid = '';
            $pinnerstylesunderline = '';
            $pinnerstylessz = '';         
           
        
            if (count($para->data) > 0)
            {
                foreach($para->data as $node)
                {
                    if (strcmp($node->name, "styleId") == 0)
                    {
                        $type = $node->type;
                        $pinnerstylesid = $node->innerstyle;
                       
                        foreach($this->styles as $style)
                        {
                            if (strcmp ($node->value, $style->styleId) == 0)
                            {
                                $styleselect = $style;
                            }
                        }
                    }
                
                    if (strcmp($node->name, "align") == 0)
                    {
                        $pinnerstylesid .= $node->innerstyle. ";";
                    }
                
                    if (strcmp($node->name, "drawing") == 0)
                    {
                        $type = $node->type;
                        $extent = $node->innerstyle;
                        $embedid = $node->value;
                    }
                
                    if (strcmp($node->name, "textSize") == 0)
                    {
                        $sz = $node->value;
                    }
                
                    if (strcmp($node->name, "textUnderline") == 0)
                    {
                       $pinnerstylesunderline = $node->innerstyle;
                    }
                }
            }
     
           if (strcmp($type, 'text') == 0)
           {
                //echo "has valid para";
                //echo "<br>";
                if ($styleselect != null)
                {
                    //echo "has valid style";
                    //echo "<br>";
                
                    if (strcmp($styleselect->color, '') != 0)
                    {
                       $pinnerstylesid .= "color:#" . $styleselect->color. ";";
                    }
                }
            
                if ($sz != 0)
                {
                    $pinnerstylesid .= 'font-size:' . $sz . 'px;';
                    //echo "sz<br>";
                }
            
                $span =  "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
                $span .= $content;
                $span .= "</p>";
                //echo $span;
                $html .= $span;
            }
        
            if (strcmp($type, 'graphic') == 0)
            {
                $imglnk = '';
            
                foreach($this->rels as $rel)
                {
                    if(strcmp($embedid, '') != 0 && strcmp($rel->Id, $embedid) == 0)
                    {
                        foreach($this->imglnks as $imgpathdef)
                        {
                            if (strpos($imgpathdef->extractedpath, $rel->Target) >= 0)
                            {
                                $imglnk = $imgpathdef->extractedpath;
                                //echo "has img link<br>";
                                //echo $imglnk . "<br>";
                            }
                        }
                    }
                }
            
                if ($styleselect != null)
                {
                    //echo "has valid style";
                    //echo "<br>";
                
                    if (strcmp($styleselect->color, '') != 0)
                    {
                        $pinnerstylesid .= "color:#" . $styleselect->color. ";";
                    }
                }
            
                if ($sz != 0)
                {
                    $pinnerstylesid .= 'font-size:' . $sz . 'px;';
                    //echo "sz<br>";
                }
            
                $span =  "<p style='". $pinnerstylesid . $pinnerstylesunderline ."'>";
                $span .= "<img style='". $extent ."' alt='image coming soon' src ='". $imglnk ."'/>";
                $span .= "</p>";
                //echo $span;
                $html .= $span;
            }
           
        }
        return $html;
    }

    public function get_errors() {
        return $this->errors;
    }

    private function getStyles() {
    
    }

 }

 function getDocX($path)
 {
    //echo $path;
    $doc = new Docx_reader();
    $doc->setFile($path);

    if(!$doc->get_errors()) {
        $doc->processDocument();
        $html = $doc->to_html();
        echo $html;
    }
    return "";
}
?>
0

精彩评论

暂无评论...
验证码 换一张
取 消