开发者

How to handle nested form tags with lxml

开发者 https://www.devze.com 2023-03-18 23:50 出处:网络
I want to scrape some html pages that have nested form elements with lxml. Even BeautifulSoup chokes on these pages, the only parser I\'ve found that can handle them so far is Minimal开发者_如何学运维

I want to scrape some html pages that have nested form elements with lxml. Even BeautifulSoup chokes on these pages, the only parser I've found that can handle them so far is Minimal开发者_如何学运维Soup which has no knowledge of which tags can be nested or not.

Does lxml have any parsers that don't care about about nested form tags? Any other suggestions?

If I have to I'll just continue using MinimalSoup.


How about lxml.etree.HTMLParser? That should work relatively well, right?

import urllib2
import lxml.etree as etree
page = urllib2.urlopen(url)
parser = etree.HTMLParser()
tree = etree.parse(page,parser)

And you have your tree!

0

精彩评论

暂无评论...
验证码 换一张
取 消