So, we have a great applicatio开发者_运维百科n, that is going well, but some of our users like to copy their text to word before pasting into our application. When they do that, the HTML is parsed out somewhat properly, but usually contains tags from outlook or word, that our XHTML engine just doesn't like, or understand.
For example, a user types in a note into Word, has some minor formatting in it, and they past into our HTML editor (it's just a basic webbrowser with designmode turned on), the subsequent source includes <_o3a_p> tags, among others.
Am i going to have to just write a stripper for every type of MSO html tag?
I have had good luck pasting WORD content to Libre Office, and then re-selecting and copying the text out of Libre Office into a web form.
It keeps the formatting, and links, and removes all the Microsoft formatting Code.
As a user that sometimes copies data from Word to a web form (I sometimes like to spellcheck first), I've found great success by first pasting into Notepad, then copying from there and pasting into the web form.
However, Word still sometimes has the last laugh. If you have "smart quotes" enabled, it turns
This is the "best" way.
into
This is the “best” way.
(Note the quotes around the word "best").
The easy way to fix this is to turn off Smart Quotes before I begin to type; I can also use Notepad to find all of the "smart quote" symbols (“ ” ‘ ’) and replace them with "normal quote" symbols (" " ' ').
The consensus seems to be that while some tools available are somewhat successful at auto parsing ms work tags, none are 100% perfect. Methods to parse those tags depend upon what framework you are using.
Regular expression would probably be a clean fix.
Some more information about this topic can be found
on this blog post that basically documents the same struggle you seem to be having.
精彩评论