开发者

Convert from Word document to HTML

开发者 https://www.devze.com 2022-12-20 09:21 出处:网络
I want to s开发者_开发百科ave the Word document in HTML using Word Viewer without having Word installed in my machine. Is there any way to accomplish this in C#?For converting .docx file to HTML forma

I want to s开发者_开发百科ave the Word document in HTML using Word Viewer without having Word installed in my machine. Is there any way to accomplish this in C#?


For converting .docx file to HTML format, you can use OpenXmlPowerTools. Make sure to add a reference to OpenXmlPowerTools.dll.

using OpenXmlPowerTools;
using DocumentFormat.OpenXml.Wordprocessing;

byte[] byteArray = File.ReadAllBytes(DocxFilePath);
using (MemoryStream memoryStream = new MemoryStream())
{
     memoryStream.Write(byteArray, 0, byteArray.Length);
     using (WordprocessingDocument doc = WordprocessingDocument.Open(memoryStream, true))
     {
          HtmlConverterSettings settings = new HtmlConverterSettings()
          {
               PageTitle = "My Page Title"
          };
          XElement html = HtmlConverter.ConvertToHtml(doc, settings);

          File.WriteAllText(HTMLFilePath, html.ToStringNewLineOnAttributes());
     }
}


You can try with Microsoft.Office.Interop.Word;

   using Word = Microsoft.Office.Interop.Word;

    public static void ConvertDocToHtml(object Sourcepath, object TargetPath)
    {

        Word._Application newApp = new Word.Application();
        Word.Documents d = newApp.Documents;
        object Unknown = Type.Missing;
        Word.Document od = d.Open(ref Sourcepath, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown,
                                 ref Unknown, ref Unknown, ref Unknown, ref Unknown);
        object format = Word.WdSaveFormat.wdFormatHTML;



        newApp.ActiveDocument.SaveAs(ref TargetPath, ref format,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown, ref Unknown,
                    ref Unknown, ref Unknown);

        newApp.Documents.Close(Word.WdSaveOptions.wdDoNotSaveChanges);


    }


I wrote Mammoth for .NET, which is a library that converts docx files to HTML, and is available on NuGet.

Mammoth tries to produce clean HTML by looking at semantic information -- for instance, mapping paragraph styles in Word (such as Heading 1) to appropriate tags and style in HTML/CSS (such as <h1>). If you want something that produces an exact visual copy, then Mammoth probably isn't for you. If you have something that's already well-structured and want to convert that to tidy HTML, Mammoth might do the trick.


I think this will depend on the version of the Word document. If you have them in docx format, I believe they are stored within the file as XML data (but it is so long since I looked at the specification I am perfectly happy to be corrected on that).


According to this Stack Overflow question, it isn't possible with word viewer. You will need Word to use COM Interop to interact with Word.


If you're open to not using C#, you could do something like print to file using PrimoPDF (which would change the .doc into a .pdf) and then use a PDF to HTML converter to go the rest of the way. After that you can edit your html however you like.


Another similar topic which I have got is Convert Word to HTML then render HTML on webpage. I think you might find this helpful if you are still on it. There's a freely distributed dll for this. I have given the link there.


Gembox works pretty well. It even converts images in the Word doc to base64 encoded strings in img tags.


You will need to have MS Word installed to do this, I believe.

Check out this article for details on the implementation.


Using the document conversion tools available in OpenOffice.org is probably the only possible option - the .doc format is only designed to be opened via Microsoft products so any libraries dealing with it will need to have reverse engineered the entire format.

0

精彩评论

暂无评论...
验证码 换一张
取 消