BeautifulSoup -- Prevent Tag From Automatically Closing_问答_开发者

BeautifulSoup -- Prevent Tag From Automatically Closing

开发者 https://www.devze.com 2023-03-13 03:10 出处：网络

BeautifulSoup is choking on parsing the follo开发者_StackOverflowwing code: >>> soup = BeautifulSoup(\'<img src=\"#\" alt=\"Click Here >\" border=\"0\" />\')

相关专题：python

BeautifulSoup is choking on parsing the follo开发者_StackOverflowwing code:

>>> soup = BeautifulSoup('<img src="#" alt="Click Here >" border="0" />')
>>> soup.prettify()
'<img src="#" alt="Click Here &gt;" />\n" border="0" />\n'

I should also note, I have no control over the input html. There are many different variations of the text/attributes so I want to avoid using Regex.

Anyone have a suggestion for stopping BeautifulSoup from automatically closing the img tag when it runs into the ">" symbol?

Edit 1: I have found this in the documentation. Could I control how BeautifulSoup parses the IMG tag?

Edit 2: I solved my problem. Before I called BS, I did did a text replace

text.replace('>"','&gt;"')

BeautifulSoup4 has been updated to be context aware and has since solved this issue. If you update to the latest version of BeautifulSoup4 it will ignore the > tag when enclosed in quotes.

soup = BeautifulSoup('<img src="#" alt="Click Here >" border="0" />')
print(soup.img.attrs)
# {'src': '#', 'alt': 'Click Here >', 'border': '0'}
soup.prettify()
# '<img src="#" alt="Click Here &gt;" />\n" border="0" />\n'

The example shows that the alt attribute correctly has the > character, and the border attribute has been recognised.