I need to parse an html document which contains "code" 开发者_JS百科tags
I'm getting the code blocks like this:
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')
The problem is, if i have a code tag like this:
<code class="csharp">
List<Person> persons = new List<Person>();
</code>
BeautifulSoup forse the closing of nested tags and transform the code block into:
<code class="csharp">
List<person> persons = new List</person><person>();
</person>
</code>
is there any way to extract the content of the code tags as text with BeautifulSoup without letting it fix what IT thinks are html markup errors?
Add the code tag to the QUOTE_TAGS dictionary.
from BeautifulSoup import BeautifulSoup
content = "<code class='csharp'>List<Person> persons = new List<Person>();</code>"
BeautifulSoup.QUOTE_TAGS['code'] = None
soup = BeautifulSoup(str(content))
code_blocks = soup.findAll('code')
Output:
[<code class="csharp"> List<Person> persons = new List<Person>(); </code>]
精彩评论