I am working with a large collection of documents that are prepared by more than 5K different entities. One of the things I am trying to do is to determine whether or not a box has been checked. The preparer needs to indicate some information by checking one of five different boxes.
The problem is that the preparer decided on their own how to present a check box in the html. Some of their representations are interesting. They mostly rely on win开发者_如何转开发gdings as the font directive. Here are a few of the types of checked boxes I have found so far
'serif">S</font>'
'wingdings">x</font>'
'ü'
'ý'
'þ'
<font style="font-family: Wingdings; font-variant: normal">þ</font>
The piece of code that I pasted above will display a checked box when the document is opened with a variant of IE, it will render something else when the document is opened with Firefox, Safari or Chrome.
Here is another example
<div style="DISPLAY: block; MARGIN-LEFT: 0pt; TEXT-INDENT: 0pt; MARGIN-RIGHT: 0pt" align="center"><font style="DISPLAY: inline; FONT-SIZE: 10pt; FONT-FAMILY: times new roman">THE DATA THAT HAS THE CHECKED BOX <font style="DISPLAY: inline; FONT-FAMILY: wingdings 2, serif">R</font></font></div>
So I guess in its simplest form my question is
Is there something in python that 'knows' that
<font style="DISPLAY: inline; FONT-FAMILY: wingdings 2, serif">R</font>
this is a checked box? And then extending that further - is there something that 'knows' this for just about every way a checked box can be presented in html code?
I want to note that when I check the text of that font element I get a unicode R
I hope this is clearer.
The way I see it, it appears like this.
The ascii value of 'S' is 83. If you look up 83 on wingdings, you get "droplet". The Unicode equivalent of "droplet" is
精彩评论