i have a huge database of scraped forum po开发者_运维问答sts that i am inserting into a website. however alot of people try to use html in their forum posts and often times do it wrong. because of this, there are always stray <strike> <b> </strike> </div> </b>
tags in the posts which will end up messing up the webpage format when i add say 15 forum posts.
for now i have just been appending all possible end tags to the post just so that it might catch any open tag...is there a better way to do this short of parsing through the text and trying to manually remove each open tag. for loooooong forum posts this is an expensive transaction for a web app.
Have a look at HTML Tidy
There is a also a Python wrapper lib: µTidylib
Alternatively there is HTML Purifier
Beautiful Soup does a decent job at HTML cleanup.
Look at lxml
also.
精彩评论