开发者

How to have the HTMLParser continue parsing after a parse error?

开发者 https://www.devze.com 2023-02-27 17:45 出处:网络
I am creating a web crawler and I use HTMLParser module to parse the links out of an HTML document. If the parser comes across bad markup it raises a parse error and terminates the application. Since

I am creating a web crawler and I use HTMLParser module to parse the links out of an HTML document. If the parser comes across bad markup it raises a parse error and terminates the application. Since the crawler traverses the whole web this error gets raised quite often.

On the python.org bug section, someone already raised this issue. You can look at that here. The problem with this is that I don't really know how to use the "patch" that is provided and I don't understand the comments.

I want over开发者_如何学JAVAride the default behavior the HTMLParser module to allow it to continue parsing after a parse error.


You should use BeautifulSoup instead of HTMLParser. BeautifulSoup is much more robust.

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.


I don't use HTMLParser myself, but can you not just place your statement in a try/except block?

try:
  myval = HTMLParser.flimsyFunction()
except HTMLParser.ParseError:
  myval = None


There are certain blogs and pages who do not want their pages to be scanned and parsed via bots and parser programs. They want many parsers to give error situations.

Many a times its written in the code of the webpage as

document.write('<sci'+<pt'...)

in this way via javascript the users try to insert code but when this whole feed is set to be parsed the parser gives an error saying that a "bad tag was encountered" and it stops.

The best way to solve this problem is before parsing just remove all the javascript code in case you just need the content, and you'll be working just fine :)

0

精彩评论

暂无评论...
验证码 换一张
取 消