开发者

How to tell BeautifulSoup to extract the content of a specific tag as text? (without touching it)

开发者 https://www.devze.com 2023-02-08 20:14 出处:网络
I need to parse an html document which contains \"code\" 开发者_JS百科tags I\'m getting the code blocks like this:

I need to parse an html document which contains "code" 开发者_JS百科tags

I'm getting the code blocks like this:

soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

The problem is, if i have a code tag like this:

<code class="csharp">
    List<Person> persons = new List<Person>();
</code>

BeautifulSoup forse the closing of nested tags and transform the code block into:

<code class="csharp">
    List<person> persons = new List</person><person>();
    </person>
</code>

is there any way to extract the content of the code tags as text with BeautifulSoup without letting it fix what IT thinks are html markup errors?


Add the code tag to the QUOTE_TAGS dictionary.

from BeautifulSoup import BeautifulSoup

content = "<code class='csharp'>List<Person> persons = new List<Person>();</code>"

BeautifulSoup.QUOTE_TAGS['code'] = None
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')

Output:

[<code class="csharp"> List<Person> persons = new List<Person>(); </code>]
0

精彩评论

暂无评论...
验证码 换一张
取 消