Programmatically clean Word-generated HTML while preserving styles?_问答_开发者

In my current company, we have this decade old...let's call it a "Hello World" application.

While wanting to create a newer version of it, we also want to preserve older entries. These older entries contain hideous Word-generated HTML which was never filtered before.

If and when we move to a newer system, I'd prefer to have that HTML cleaned and filtered in order to have the site comply with HTML standards as much as possible.

However, just cleaning 开发者_开发技巧that code like Jeff Atwood described in his blog or in any other way I know of would also ruin the style and formatting.

Now, that just might cause our users to revolt and then all hell will break loose - not a very good idea.

So the question is: Can Word's HTML be cleaned while preserving basic formatting? (e.g: coloring, italicized, bold text and so on)

Preferably using publicly available code or library, such as HTML Tidy, examples in C# would be much appreciated.

There are a couple of options available, but you can certainly use Jeff Atwood's as a good starting point to code your own. If so, you'll likely get fine-tuned control over the result - note though that the results will be never been 100% accurate as all that extra ms-code is actually there to ensure as much fidelity with the original document as possible (at least in IE for round-tripping purposes). But most code out there does preserve most formatting.

Here are some code libraries that could be helpful:

Microsoft Word 2000 HTML Mess Cleaner (note: this one sells the source code)
MS Word HTML Cleanup Tool (note: intended to work with FCKEditor, but source is available)

If you're just wanting batch-processing (and don't care about owning a code base), the Office 2000 HTML Filter 2.0 is probably your best best - read more about it on TechRepublic.

tidy works fine for cleaning up and regularizing html syntax.

It's very configurable, so for a batch cleanup, it's likely the command line tool will do what you need. You don't have to program tidylib yourself.

If you need to do more involved cleanup of the content - not just the syntax - some xslt processors ( xsltproc, for one ) have an '--html' option: input files are parsed by the html parser instead of an xml parser. You can then use xslt to transform or rearrange the content, then output with the html serializer.

This SO question poses a similar problem, although there, programmatic cleanup is not required.

One of the answers mentions that Office 2007 has a Publish->Blog menu item that reportedly produces good results and is fast. You could create a macro from Word to invoke this command, and then programmatically invoke the macro. You can use COM or VBScript to start word and run the macro, or run winword.exe with the /m switch. Command line switches to winword.exe are given here.

Do have a budget for it. This might Work . Try before you buy.

Take a look at FCKEditor , its a javascript-based editor, so looking at the source might give you lots of hints as to what to look for when removing word HTML.

In particular, take a look at the file, /editor/dialog/fck_paste.html. There's a function, "CleanWord" does it all. I've modified it for use in my own applications (slight modifications, ie. different replacements, etc...), however it does a great job of getting rid of ugly Word HTML.

It does it using regular expressions to find and replace, which means you can easily extra the regex and import it into another programming language of your choice to run the batch job.

PSPad includes tidy, which has a "Clean Microsoft Word 2000" option which I've used for word documents before and it's customizable.

The HtmlRuleSanitizer (available on NuGet) can do this for you out of the box.

It uses the HTML Agility Pack to parse the HTML code and uses a set of white list based rules to preserve formatting. The default rule sets will get rid of virtually all the verbose MS Word HTML code while preserving basic document structure like header tags, bold, italic, etc.

If you want to preserve specific MS Word styling you'll have to create or adapt a rule set for your use case.

It will for example easily convert the hundreds of lines of HTML code which MS Word would generate for a document containing the following:

Heading one

Paragraph

Heading two

Bold

Italic

A Link

To only the following set of relatively clean HTML:

<html>
<body>
<h1><span>Heading</span> <span>one</span></h1>
<p><span>Paragraph</span></p>
<h2><span>Heading</span> <span>two</span></h2>
<p><span><strong>Bold</strong></span><strong></strong></p>
<p><span><i>Italic</i></span><i></i></p>
<p><i><a href="http://www.google.com/" target="_blank" rel="nofollow">Link</a></i></p>
</body>
</html>

Note that some of the annoying stuff MS Word is doing like opening and closing tags very often (see the span elements in the example) are not fully cleaned out.

Here is a set of PowerShell scripts that will clean Word-Filtered HTML and correctly tag super/subscripts about 95% of the time. (No, you can't get better than that, Word is made for print.)

https://github.com/suzumakes/replaceit

Basic formatting is kept intact, tags become tags and tags become tags. I think this is what you're looking for, and even though you shouldn't use Regex to parse HTML, Word-Filtered HTML is hardly filtered, but it is clean after these powershell scripts are run on it.

Instructions are there in the ReadMe and if you happen to encounter any additional characters that need to be caught or come up with any tweaks/improvements, I'd be happy to see your pull request.