开发者

Removing broken tags and poorly formatted html from some text

开发者 https://www.devze.com 2023-01-12 01:19 出处:网络
i have a huge database of scraped forum po开发者_运维问答sts that i am inserting into a website. however alot of people try to use html in their forum posts and often times do it wrong. because of thi

i have a huge database of scraped forum po开发者_运维问答sts that i am inserting into a website. however alot of people try to use html in their forum posts and often times do it wrong. because of this, there are always stray <strike> <b> </strike> </div> </b> tags in the posts which will end up messing up the webpage format when i add say 15 forum posts.

for now i have just been appending all possible end tags to the post just so that it might catch any open tag...is there a better way to do this short of parsing through the text and trying to manually remove each open tag. for loooooong forum posts this is an expensive transaction for a web app.


Have a look at HTML Tidy

There is a also a Python wrapper lib: µTidylib

Alternatively there is HTML Purifier


Beautiful Soup does a decent job at HTML cleanup.


Look at lxml also.

0

精彩评论

暂无评论...
验证码 换一张
取 消