开发者

scraping text from multiple html files into a single csv file

开发者 https://www.devze.com 2023-02-03 10:15 出处:网络
I have just over 1500html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but \"misses\" out some of the data within the table.

I have just over 1500 html pages (1.html to 1500.html). I have written a code using Beautiful Soup that extracts most of the data I need but "misses" out some of the data within the table.

My Input: e.g file 1500.html

My Code:

#!/usr/bin/env python
import glob
import codecs
from BeautifulSoup import BeautifulSoup
with codecs.open('dump2.csv', "w", encoding="utf-8") as csvfile:
for file in glob.glob('*html*'):
        print 'Processing', file
        soup = BeautifulSoup(open(file).read())
        rows = soup.findAll('tr')
        for tr in rows:
                cols = tr.findAll('td')
                #print >> csvfile,"#".join(col.string for col in cols)
                #print >> csvfile,"#".join(td.find(text=True))
                for col in cols:
                        print >> csvfile, col.string
                print >> csvfile, "==="
        print >> csvfile, "***"

Output:

One CSV file, with 1500 lines of text and columns of data开发者_如何转开发. For some reason my code does not pull out all the required data but "misses" some data, e.g the Address1 and Address 2 data at the start of the table do not come out. I modified the code to put in * and === separators, I then use perl to put into a clean csv file, unfortunately I'm not sure how to work my code to get all the data I'm looking for!


find files where you get missed parameters, and after that try to analyse what happened...

I think that same files have different format, or maybe realy address filed is missed.

0

精彩评论

暂无评论...
验证码 换一张
取 消