Reading DOC file in php_问答_开发者_运维开发者技术经验分享

I'm trying to read .doc .docx file in php. All is working fine. But at last line I'm getting awful characters. Please help me. Here is code which is developed by someone.

    function parseWord($userDoc) 
{
    $fileHandle = fopen($userDoc, "r");
    $line = @fread($fileHandle, filesize($userDoc));   
    $lines = explode(chr(0x0D),$line);
    $outtext = "";
    foreach($lines as $thisline)
      {
        $pos = strpos($thisline, chr(0x00));
        if (($pos !== FALSE)||(strlen($thisline)==0))
          {
          } else {
开发者_开发问答            $outtext .= $thisline." ";
          }
      }
     $outtext = preg_replace("/[^a-zA-Z0-9\s\,\.\-\n\r\t@\/\_\(\)]/","",$outtext);
    return $outtext;
} 

$userDoc = "k.doc";

Here is screenshot.

Reading DOC file in php

You can read .docx files in PHP but you can't read .doc files. Here is the code to read .docx files:

function read_file_docx($filename){

    $striped_content = '';
    $content = '';

    if(!$filename || !file_exists($filename)) return false;

    $zip = zip_open($filename);

    if (!$zip || is_numeric($zip)) return false;

    while ($zip_entry = zip_read($zip)) {

        if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

        if (zip_entry_name($zip_entry) != "word/document.xml") continue;

        $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

        zip_entry_close($zip_entry);
    }// end while

    zip_close($zip);

    //echo $content;
    //echo "<hr>";
    //file_put_contents('1.xml', $content);

    $content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
    $content = str_replace('</w:r></w:p>', "\r\n", $content);
    $striped_content = strip_tags($content);

    return $striped_content;
}
$filename = "filepath";// or /var/www/html/file.docx

$content = read_file_docx($filename);
if($content !== false) {

    echo nl2br($content);
}
else {
    echo 'Couldn\'t the file. Please check that file.';
}

DOC files are not plain text.

Try a library such as PHPWord (old CodePlex site).

nb: This answer has been updated multiple times as PHPWord has changed hosting and functionality.

I am using this function working well for me :) try it

function read_doc_file($filename) {
     if(file_exists($filename))
    {
        if(($fh = fopen($filename, 'r')) !== false ) 
        {
           $headers = fread($fh, 0xA00);

           // 1 = (ord(n)*1) ; Document has from 0 to 255 characters
           $n1 = ( ord($headers[0x21C]) - 1 );

           // 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
           $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );

           // 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
           $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );

           // 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
           $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );

           // Total length of text in the document
           $textLength = ($n1 + $n2 + $n3 + $n4);

           $extracted_plaintext = fread($fh, $textLength);

           // simple print character stream without new lines
           //echo $extracted_plaintext;

           // if you want to see your paragraphs in a new line, do this
           return nl2br($extracted_plaintext);
           // need more spacing after each paragraph use another nl2br
        }
    }   
    }

Decoding in pure PHP never worked for me, so here is my solution : http://wvware.sourceforge.net/

Install package

sudo apt-get install wv elinks

Use it in PHP :

$output = str_replace('.doc', '.txt', $filename);
shell_exec('/usr/bin/wvText ' . $filename . ' ' . $output);
$text = file_get_contents($output);
# Convert to UTF-8 if needed
if(!mb_detect_encoding($text, 'UTF-8', true))
{
    $text = utf8_encode($text);
}
unlink($output);

I also used it but for accents ( and single quotes like ' ) it would put � instead SOo my PDO mySQL didn't like it but I finally figured it out by adding

mb_convert_encoding($extracted_plaintext,'UTF-8');

So the final version should read:

function getRawWordText($filename) {
    if(file_exists($filename)) {
        if(($fh = fopen($filename, 'r')) !== false ) {
            $headers = fread($fh, 0xA00);
            $n1 = ( ord($headers[0x21C]) - 1 );// 1 = (ord(n)*1) ; Document has from 0 to 255 characters
            $n2 = ( ( ord($headers[0x21D]) - 8 ) * 256 );// 1 = ((ord(n)-8)*256) ; Document has from 256 to 63743 characters
            $n3 = ( ( ord($headers[0x21E]) * 256 ) * 256 );// 1 = ((ord(n)*256)*256) ; Document has from 63744 to 16775423 characters
            $n4 = ( ( ( ord($headers[0x21F]) * 256 ) * 256 ) * 256 );// 1 = (((ord(n)*256)*256)*256) ; Document has from 16775424 to 4294965504 characters
            $textLength = ($n1 + $n2 + $n3 + $n4);// Total length of text in the document
            $extracted_plaintext = fread($fh, $textLength);
            $extracted_plaintext = mb_convert_encoding($extracted_plaintext,'UTF-8');
             // if you want to see your paragraphs in a new line, do this
             // return nl2br($extracted_plaintext);
             return ($extracted_plaintext);
        } else {
            return false;
        }
    } else {
        return false;
    }  
}

This works fine in a utf8_general_ci mySQL database to read word doc files :)

Hope this helps someone else

I'm using soffice to convert doc to txt and read txt converted file