开发者

HTMLParser and weird behavior

开发者 https://www.devze.com 2023-02-21 00:30 出处:网络
I have to extract an information from the following web page with Python 3: http://www.homefinance.nl/english/international-interest-rates/libor/libor-interest-rates-gbp.asp

I have to extract an information from the following web page with Python 3: http://www.homefinance.nl/english/international-interest-rates/libor/libor-interest-rates-gbp.asp

The download using urllib.request seems ok, but surprisingly, when I parse the html file with my HTMLParser class the parsing seems to stop in the middle of the meta tags, without giving any rationales.

This is my code:

import urllib.request
from html.parser import HTMLParser

def downloadLIBOR():
    html_file = urllib.request.urlopen("http://www.homefinance.nl/english/international-interest-rates/libor/libor-interest-rates-gbp.asp")
    return html_file

class tmpHTMLParser(HTMLParser):

    def __init__(self):
        self._libor = "0.81625 %"
        self._stack = []
        self._properStack = []
        super().__init__()

    def handle_starttag(self, tag, attrs):
        print("starttag " + str(tag))
        print(self.get_starttag_text())
        self._stack.append(tag)

    def handle_startendtag(self, tag, attrs):
        prin开发者_高级运维t("startendtag")

    def unknown_decl(self, data):
        print("unknown_decl")

    def handle_endtag(self, tag):
        print("endtag " + str(tag))
        self._stack.pop()

def _buildProperStack(webpage):
    """dev tool: return the stack leading to the libor rate libor into the webpage webpage."""
    parser = tmpHTMLParser()
    parser.feed(webpage)

if __name__ == "__main__":
    webpage = downloadLIBOR()
    print("download done")
    html = str(webpage.read())
    _buildProperStack(html)
    exit(0)


BTW, I noticed that you forgot to do a parser.close() after the parser.feed(). It might be buffering something, and the close will force it to finish.


Not sure what you are actually trying to do but using BeautifulSoup for parsing HTML is much nicer and easier and less error-prone.

0

精彩评论

暂无评论...
验证码 换一张
取 消