开发者

BeautifulSoup is choking on jQuery script, any known workaround?

开发者 https://www.devze.com 2023-01-24 20:22 出处:网络
I\'m giving BeautifulSoup an html document and simply by constructing a BeautifulSoup object instance with the full html, it seems to choke on the following line of a jQuery script that\'s embedded wi

I'm giving BeautifulSoup an html document and simply by constructing a BeautifulSoup object instance with the full html, it seems to choke on the following line of a jQuery script that's embedded within the html:

        var txt = "Logged in as: <a href=\"http://somedomain.com/the-blah/\">" + uname + "</a> <small>(<a href=\"http://somedomain.com/the-blah/\">The Blah</a> | <a href=\"http://somedomain.com/the-blah/?action=logout\">logout</a>)</small>";

The full stack trace for the error is the following:

    /usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, *args, **kwargs)
   1497             kwargs['smartQuotesTo'] = self.HTML_ENTITIES
   1498         kwargs['isHTML'] = True
-> 1499         BeautifulStoneSoup.__init__(self, *args, **kwargs)
   1500 
   1501     SELF_CLOSING_TAGS = buildTagMap(None,

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in __init__(self, markup, parseOnlyThese, fromEncoding, markupMassage, smartQuotesTo, convertEntities, selfClosingTags, isHTML, builder)
   1228         self.markupMassage = markupMassage
   1229         try:
-> 1230             self._feed(isHTML=isHTML)
   1231         except StopParsing:
   1232             pass

/usr/local/lib/python2.6/dist-packages/BeautifulSoup-3.1.0.1-py2.6.egg/BeautifulSoup.pyc in _feed(self, inDocumentEncoding, isHTML)
   1261         self.builder.reset()
   1262 
-> 1263         self.builder.feed(markup)
   1264         # Close out any unfinished strings and close all the open tags.

   1265         self.endData()

/usr/lib/python2.6/HTMLParser.pyc in feed(self, data)
    106         """
    107         self.rawdata = self.rawdata + data
--> 108         self.goahead(0)
    109 
    110     def close(self):

/usr/lib/python2.6/HTMLParser.pyc in goahead(self, end)
    146             if startswith('<', i):
    147                 if starttagopen.match(rawdata, i): # < + letter
--> 148                     k = self.parse_starttag(i)
    149                 elif startswith("</", i):
    150   开发者_Python百科                  k = self.parse_endtag(i)

/usr/lib/python2.6/HTMLParser.pyc in parse_starttag(self, i)
    227     def parse_starttag(self, i):
    228         self.__starttag_text = None
--> 229         endpos = self.check_for_whole_start_tag(i)
    230         if endpos < 0:
    231             return endpos

/usr/lib/python2.6/HTMLParser.pyc in check_for_whole_start_tag(self, i)
    302                 return -1
    303             self.updatepos(i, j)
--> 304             self.error("malformed start tag")
    305         raise AssertionError("we should not get here!")
    306 

/usr/lib/python2.6/HTMLParser.pyc in error(self, message)
    113 
    114     def error(self, message):
--> 115         raise HTMLParseError(message, self.getpos())
    116 
    117     __starttag_text = None

HTMLParseError: malformed start tag, at line 193, column 110

From what I can glean it has something to do with the angle brackets being within quotes, it seems to be thrown off by this. What kind of work around is there, or is there another library that handles these edge cases better? Or alternatively, is there a way to tell it to ignore all javascript content?


the easiest way would probably be to delete all the scripts. see the section Removing Elements in the documentation: http://www.crummy.com/software/BeautifulSoup/documentation.html#Removing%20elements

0

精彩评论

暂无评论...
验证码 换一张
取 消