BeautifulSoup is choking on parsing the follo开发者_StackOverflowwing code:
>>> soup = BeautifulSoup('<img src="#" alt="Click Here >" border="0" />')
>>> soup.prettify()
'<img src="#" alt="Click Here >" />\n" border="0" />\n'
I should also note, I have no control over the input html. There are many different variations of the text/attributes so I want to avoid using Regex.
Anyone have a suggestion for stopping BeautifulSoup from automatically closing the img tag when it runs into the ">" symbol?
Edit 1: I have found this in the documentation. Could I control how BeautifulSoup parses the IMG tag?
Edit 2: I solved my problem. Before I called BS, I did did a text replace
text.replace('>"','>"')
BeautifulSoup4 has been updated to be context aware and has since solved this issue. If you update to the latest version of BeautifulSoup4 it will ignore the >
tag when enclosed in quotes.
soup = BeautifulSoup('<img src="#" alt="Click Here >" border="0" />')
print(soup.img.attrs)
# {'src': '#', 'alt': 'Click Here >', 'border': '0'}
soup.prettify()
# '<img src="#" alt="Click Here >" />\n" border="0" />\n'
The example shows that the alt
attribute correctly has the >
character, and the border
attribute has been recognised.
精彩评论