I'd like to know how to fix broken html tags before parsing it with Beautiful Soup.
In the following script the td>
needs to be replaced with <td
.
How can I do the substitution so Beautiful Soup can see it?
from BeautifulSoup import BeautifulSoup
s = """
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>"""
a = BeautifulSoup(s)
left = []
right = []
for tr in a.findAll('tr'):
l, r = tr.findAll('td')
left.extend(l.findAll(text=True))
ri开发者_运维技巧ght.extend(r.findAll(text=True))
print left + right
Edit (working):
I grabbed a complete (at least it should be complete) list of all html tags from w3 to match against. Try it out:
fixedString = re.sub(">\s*(\!--|\!DOCTYPE|\
a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>>", s)
bs = BeautifulSoup(fixedString)
Produces:
>>> print s
<tr>
td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>
>>> print re.sub(">\s*(\!--|\!DOCTYPE|\
a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>>", s)
<tr><td>LABEL1</td><td>INPUT1</td>
</tr>
<tr>
<td>LABEL2</td><td>INPUT2</td>
</tr>
This one should match broken ending tags as well (</endtag>
):
re.sub(">\s*(/?)(\!--|\!DOCTYPE|\a|abbr|acronym|address|applet|area|\
b|base|basefont|bdo|big|blockquote|body|br|button|\
caption|center|cite|code|col|colgroup|\
dd|del|dfn|dir|div|dl|dt|\
em|\
fieldset|font|form|frame|frameset|\
head|h1|h2|h3|h4|h5|h6|hr|html|\
i|iframe|img|input|ins|\
kbd|\
label|legend|li|link|\
map|menu|meta|\
noframes|noscript|\
object|ol|optgroup|option|\
p|param|pre|\
q|\
s|samp|script|select|small|span|strike|strong|style|sub|sup|\
table|tbody|td|textarea|tfoot|th|thead|title|tr|tt|\
u|ul|\
var)>", "><\g<1>\g<2>>", s)
If that's the only thing you're concerned about td> -> , try:
myString = re.sub('td>', '<td>', myString)
Before sending myString to BeautifulSoup. If there are other broken tags give us some examples and we'll work on it : )
精彩评论