开发者

How do I make BeautifulSoup parse the contents of textarea tags as HTML?

开发者 https://www.devze.com 2022-12-27 14:05 出处:网络
Before 3.0.5, BeautifulSoup used to treat the contents of <textarea> as HTML. It now treats it as text. The document I am parsing has HTML inside the textarea tags, and I am trying to process it.

Before 3.0.5, BeautifulSoup used to treat the contents of <textarea> as HTML. It now treats it as text. The document I am parsing has HTML inside the textarea tags, and I am trying to process it.

I've tried:

    for textarea in soup.findAll('textarea'):
        contents = BeautifulSoup.BeautifulSoup(textarea.contents)
        textarea.replaceWith(contents.html(text=True))

But I'm getting errors. I can't find this in the documentation, and the alterna开发者_JAVA技巧tive parsers aren't helping. Anyone know how I can parse the textareas as HTML?

Edit:

Sample HTML is:

<textarea class="ks-lazyload-custom">
  <div class="product-view product-view-rug">
    Foobar Womble
    <div class="product-view-head">
      <img src="tps/i1/fo-25.gif" />
    </div>
  </div>
</textarea>

Error is:

File "D:\src\cross\tserver\src\tools\sitecrawl\BeautifulSoup.py", line 1913, 
in _detectEncoding '^<\?.*encoding=[\'"](.*?)[\'"].*\?>').match(xml_data)
TypeError: expected string or buffer

I'm looking for a way of taking an element, extracting the contents, parsing them with BeautifulSoup, collapsing it to text, and then replacing the contents of the original element (or replacing the whole element) with that text.

As for real world vs specs, it actually isn't particularly relevant here. The data needs to be parsed, I'm looking for the way to do so.


This seems to work fairly well (if I correctly understood what you wanted):

for textarea in soup.findAll('textarea'):
    contents = BeautifulSoup.BeautifulSoup(textarea.contents[0]).renderContents()
    textarea.replaceWith(contents)


I'm now using the following code which mostly works. Your milage may vary.

def _extractText(self, data, encoding):
    if self.isDebug: self._output("_extractText")
    soup = BeautifulSoup.BeautifulSoup(data, fromEncoding=encoding)
    comments = soup.findAll(text=lambda text:isinstance(text, BeautifulSoup.Comment))
    [comment.extract() for comment in comments]
    [script.extract() for script in soup.findAll('script')]
    [css.extract() for css in soup.findAll('style')]
    for textarea in soup.findAll('textarea'):
        textarea.string = self._extractText(textarea.renderContents(), 'UTF-8')
    text = unicode('')
    for line in soup.findAll(text=True):
        line = line.replace('&nbsp;', ' ').strip()  
        if line == '': continue
        if line.startswith('doctype'): continue
        if line.startswith('DOCTYPE'): continue
        text = text + line + '\n'
    return text
0

精彩评论

暂无评论...
验证码 换一张
取 消