I am looking for algorithms & data structures one would use to fix broken HTML. I know lots of inbuilt tools开发者_开发知识库 exist in every language to do this. But I want to learn this. Some approaches I can think of is -
- Using Regular Expressions - seems like a naive approach
- Create DOM - but how would DOM tree get created with broken html?
UPDATE: This is more of a general discussion I am expecting. But if you refer to any tools in C, C++, Python or Java is fine by me.
thanks
Parse the markup using the HTML 5 parsing algorithm (which is designed to handle brokenness), and build a DOM from it. You can then serialize back to HTML.
RegEx + HTML = disaster.
There are just too many ways for HTML to be valid SGML yet break RegEx rules.
Really you need stateful SGML parsers. You don't mention what languages you're willing to work in, but there are many stateful SGML parsers out there.
In .NET we regularly use SGMLReader - a stateful parser that returns wellformed DOM and/or XML DOM.
In C, W3C has a reasonable C SGML Parser
In Java there is a SAX-style SGML parser
I agree with the idea that the regular expressions road is long and tortuous: it is much more robust and easier to use existing codes designed just for reading broken HTLM.
Since you mention Python, the Beautiful Soup parser reputedly handles broken HTML quite nicely.
精彩评论