开发者

Complex HTML parsing with Python

开发者 https://www.devze.com 2023-01-06 07:51 出处:网络
I am already aware of tag based HTML parsing in Python using BeautifulSoup, htmllib etc. However, I want a powerful engine which can do complex tasks like read html tables, lists etc. and present th

I am already aware of tag based HTML parsing in Python using BeautifulSoup, htmllib etc.

However, I want a powerful engine which can do complex tasks like read html tables, lists etc. and present the开发者_运维知识库se as simple to use objects within code. Does python have such powerful libraries?


BeautifulSoup is a nice library and provides a good way to parse HTML with some handy ways to parse the data very easily.

What you are trying to do, can easily be done using some simple regular expressions. You can write regular expressions to search for a particular pattern of data and extract the data you need.


You might consider lxml which has a powerful HTML processor. There is another complementary module that relies on lxml called pyquery that might be just what you're looking for.

PyQuery has jQuery-like syntax, so if you're used to jQuery you'll be able to jump right in.

Here is a simple example to get the first <ul> item from aol.com:

>>> from pyquery import PyQuery as pq
>>> import urllib
>>> data = urllib.urlopen('http://aol.com').read()
>>> d = pq(data)
>>> first_ul = d('ul:first')
>>> first_ul
[<ul#dhL2>]
>>> print first_ul
<ul id="dhL2"><li class="dhL1"><a accesskey="" href="https://new.aol.com/productsweb/?promocode=827693&amp;ncid=txtlnkuswebr00000074" name="om_dirbtn1" class="_o4-0" id="om_dirbtn1">Get Free Mail</a></li>
            </ul>


The standard HTML parsers are already pretty good at giving you simple objects (e.g. iterables). Creating anything more complex than a 2D list from a table would likely be dependent on the data that was in the page.

With that said...

Here's a link to a blog post by someone who wrote a script to convert html tables to python lists. The actual file is located here.

I've never heard of a standard python library that does these sorts of operations, so your best bet might be Googling each case as you need it. Chances are someone has done what you are trying to do.

Disclaimer: You should always read and understand any code you find online before pasting it into your own applications! Citing who/where it's from is good too!

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号