what is the best way to extract data from this HTML file and put it into MySQL database with company phone number, company name and email with a primary key as phone number?
</tr><tr class="tableRowOdd">
<td>"JSC company inc. 00" <email@email.com></td>
<td>1231231234</td>
</tr><tr class="tableRowEven">
<td>"JSC company inc. 01" <email01@email.com></td>
<td>234234234234234</td>
</tr><tr class="tableRowOdd">
<td>"JSC company inc. 02" <email2@email.com></td>
<td>32423234234</td>
</tr><tr class="tableRowEven">
<td>"JSC company inc. 03" <email3@email.com></td>
<td>23423424324</td>
</tr><tr class="tableRowOdd">
<td>"JSC company inc. 04" <email4@email.com></td>
<td>2342342323242开发者_运维百科44</td>
</tr> <tr>
For extracting and general HTML munging look at
http://www.crummy.com/software/BeautifulSoup/
For the MySQL I suggest googling on: MySQL tutorial python
Here is how you get the td
contents into a python list using BeautifulSoup
:
#!/usr/bin/python
from BeautifulSoup import BeautifulSoup, SoupStrainer
def find_rows(data):
table_rows = SoupStrainer('tr')
rows = [tag for tag in BeautifulSoup(data, parseOnlyThese=table_rows)]
return rows
def cell_data(row):
cells = [tag.string for tag in row.contents]
return cells
if __name__ == "__main__":
f = open("testdata.html", "r")
data = f.read()
rows = find_rows(data)
for row in rows:
print cell_data(row)
Save your html file as testdata.html
, and run this script from the same directory.
With the data you posted here, the output is
[u'\n', u'"JSC company inc. 00" <email@email.com>', u'\n', u'1231231234', u'\n']
[u'\n', u'"JSC company inc. 01" <email01@email.com>', u'\n', u'234234234234234', u'\n']
[u'\n', u'"JSC company inc. 02" <email2@email.com>', u'\n', u'32423234234', u'\n']
[u'\n', u'"JSC company inc. 03" <email3@email.com>', u'\n', u'23423424324', u'\n']
[u'\n', u'"JSC company inc. 04" <email4@email.com>', u'\n', u'234234232324244', u'\n']
For the parsing, I definitely also recommend Beautiful Soup.
To put the text in a database, I recommend a good Python ORM. My top suggestion is to use the ORM from Django, if you can. With Django, you not only get an ORM, you also get a web interface that lets you browse through your database with a web browser; you can even enter data into the database using the web browser.
If you can't use Django, I recommend SQLAlchemy.
Good luck.
With lxml you can do it almost as easily as you could do it with jQuery.
from lxml import html
doc = html.parse('test.html').getroot()
for row in doc.cssselect('tr'):
name, phone_number = row.cssselect('td')[:2]
print name.text_content()
print phone_number.text_content()
+1 for BeautifulSoup
Now that you've got the data, you need to put it into MySQL. If you want a pure python solution, you'll also need the MySQL-Python binding.
Otherwise, the SQL you'll need to generate is relatively painless. We'll hijack gnuds example. Add to the top of the file:
import re
Then at the bottom:
exp = r'\"(.*)\" <(.*)>'
for row in rows:
matcher = re.match(exp, row[1])
name, email = matcher.groups()
phone = row[3]
sql = "INSERT INTO company (email, name, phone) VALUES ('%s','%s','%s')" % (email, name, phone)
print sql
Which gives you output like:
INSERT INTO company (email, name, phone) VALUES ('email@email.com','JSC company inc. 00','1231231234');
INSERT INTO company (email, name, phone) VALUES ('email01@email.com','JSC company inc. 01','234234234234234');
INSERT INTO company (email, name, phone) VALUES ('email2@email.com','JSC company inc. 02','32423234234');
INSERT INTO company (email, name, phone) VALUES ('email3@email.com','JSC company inc. 03','23423424324');
INSERT INTO company (email, name, phone) VALUES ('email4@email.com','JSC company inc. 04','234234232324244');
精彩评论