开发者

Python beginner: read elements in one file and use them to modify another file

开发者 https://www.devze.com 2023-03-07 02:04 出处:网络
I\'m an economist with no programming background. I\'m trying to learn how to use python because I\'ve been told that it is very powerful for parsing data from websites. At the moment, I\'m stuck with

I'm an economist with no programming background. I'm trying to learn how to use python because I've been told that it is very powerful for parsing data from websites. At the moment, I'm stuck with the following code and I would be extremely grateful for any suggestion.

First of all, I wrote a code to parse the data from this table:

http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146

The code I wrote is the following:

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os

def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
        col = row.findAll('td')
        year = col[0].div.b.font.string
        detrazione = col[1].div.b.font.string
        ordinaria = col[2].div.b.font.string
        principale = col[3].div.b.font.string
        scopo = col[4].div.b.font.string
        record = (year, detrazione, ordinaria, principale, scopo)
        print >> outfile, "|".join(record)



outfile = open("milano.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

The code reads the table, take only the information that I need and create a txt file. The code is pretty rudimental, but it accomplishes the job.

My problem starts now. The url that I posted above is just one of the approximately 200 from which I need to parse the data. All the urls are differentiated by two elements only. Using the previous url:

http://www.webifel.it/sifl/Tavola07.asp?comune=MILANO&cod_istat=15146

the two elements that uniquely identify this page are MILANO (the name of the city) and 15146 (a bureaucratic code).

What I wanted to do was, first, creating a file with two columns:

  1. In the first the names of the cities I need;
  2. In the second the bureaucratic codes.

Then, I wanted to create a loop in python that reads each line of this file, correctly modify the url in my code and perform the parsing task separately for each city.

Do you have any suggestion about how to proceed? Thanks in advance for any help and suggestion!

[Update]

Thanks to all for the helpful suggestions. I found the answer of Thomas K the most easy to implement for my knowledge of python. I still have problems, though. I modified the code in the following way:

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
import csv

def extract(soup):
table = soup.find("table", cellspacing=2)
for row in table.findAll('tr')[2:]:
        col = row.findAll('td')
        year = col[0].div.b.font.string
        detrazione = col[1].div.b.font.string
        ordinaria = col[2].div.b.font.string
        principale = col[3].div.b.font.string
        scopo = col[4].div.b.font.string
        record = (year, detrazione, ordinaria, principale, scopo)
        print开发者_运维百科 >> outfile, "|".join(record)

citylist = csv.reader(open("citycodes.csv", "rU"), dialect = csv.excel)
for city in citylist:
outfile = open("%s.txt", "w") % city
br = Browser()
br.set_handle_robots(False)
url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()

where citycodes.csv is in the following format

MILANO;12345
MODENA;67891

I get the following error:

Traceback (most recent call last):
File "modena2.py", line 25, in <module>
 outfile = open("%s.txt", "w") % city
TypeError: unsupported operand type(s) for %: 'file' and 'list'

Thanks again!


One little thing you need to fix:

This:

for city in citylist:
    outfile = open("%s.txt", "w") % city
#                                 ^^^^^^

Should be this:

for city in citylist:
    outfile = open("%s.txt" % city, "w")
#                           ^^^^^^


If the file is in CSV format then you can use csv to read it. Then just use urllib.urlencode() to generate the query string, and urlparse.urlunparse() to generate the full URL.


No need to create a separate file, use a python dictionary instead in which there is a relationship: city->code.

See: http://docs.python.org/tutorial/datastructures.html#dictionaries


Quick and dirty:

import csv
citylist = csv.reader(open("citylist.csv"))
for city in citylist:
    url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=%s" % city
    # open the page and extract the information

Assuming you have a csv file looking like:

MILANO,15146
ROMA,12345

There are more powerful tools, like urllib.urlencode() as Ignacio mentioned. But they're probably overkill for this.

P.S. Congratulations: you've done the hard bit - scraping data from HTML. Looping over a list is the easy bit.


Just scratching out the basics...

#!/usr/bin/env python

from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os

outfile = open("milano.txt", "w")

def extract(soup):
    global outfile
    table = soup.find("table", cellspacing=2)
    for row in table.findAll('tr')[2:]:
            col = row.findAll('td')
            year = col[0].div.b.font.string
            detrazione = col[1].div.b.font.string
            ordinaria = col[2].div.b.font.string
            principale = col[3].div.b.font.string
            scopo = col[4].div.b.font.string
            record = (year, detrazione, ordinaria, principale, scopo)
            print >> outfile, "|".join(record)



br = Browser()
br.set_handle_robots(False)

# fill in your cities here anyway like
ListOfCityCodePairs = [('MILANO', 15146)]

for (city, code) in ListOfCityCodePairs:
    url = "http://www.webifel.it/sifl/Tavola07.asp?comune=%s&cod_istat=d" % (city, code)
    page1 = br.open(url)
    html1 = page1.read()
    soup1 = BeautifulSoup(html1)
    extract(soup1)

outfile.close()
0

精彩评论

暂无评论...
验证码 换一张
取 消