开发者

python csv.dictreader not working with a data.gov csv

开发者 https://www.devze.com 2023-03-02 17:54 出处:网络
Using some random CSV data from data.gov, for example:\"Gravesite locations of Veterans and beneficiaries in HAWAII, as of January 2011\" http://www.data.gov/raw/4608 I am trying to parse the CSV with

Using some random CSV data from data.gov, for example: "Gravesite locations of Veterans and beneficiaries in HAWAII, as of January 2011" http://www.data.gov/raw/4608 I am trying to parse the CSV with python and process each row:

randomData = csv.DictReader(open('/downloads/ngl_hawaii.csv', 'rb'), delimiter=",")
     for row in randomData:
         print row

The sample CSV data:

d_first_name,d_mid_name,d_last_name,d_suffix,d_birth_date,d_death_date,section_id,row_num,site_num,cem_name,cem_addr_one,cem_addr_two,city,state,zip,cem_url,cem_phone,relationship,v_first_name,v_mid_name,v_last_name,v_suffix,branch,rank,war

Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","111444","","SXXXXX","Veteran (Self)","Joe","E","JoJo","","US ARMY","SGT","WORLD WAR II"

The result is not overly pretty (printing one row):

{'v_last_name': None, 'cem_addr_two': None, 'rank': None, 'd_suffix': None, 'city': None, 'row_num': None, 'zip': None, 'cem_phone': None, 'd_last_name': None, e, 'd_first_name': 'Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","11144 "SXXXXX","","US ARMY","SGT","WORLD WAR II"', 'war': None, 'v_mid_name': No开发者_运维百科ne, 'cem_url': None, 'cem_name': None, 'relationship': None, 'v_first_name': None, 'se one, 'cem_addr_one': None, 'd_birth_date': None, 'd_death_date': None}

As you can see, the header fields (first line in the csv) aren't properly being associated to each subsequent row.

Am I doing something wrong, or is the CSV quality poor?

Thanks to Casey for asking if I've opened the file up in another program. Excel messed up the file....


Strange, I get different output from you.

data.csv:

d_first_name,d_mid_name,d_last_name,d_suffix,d_birth_date,d_death_date,section_id,row_num,site_num,cem_name,cem_addr_one,cem_addr_two,city,state,zip,cem_url,cem_phone,relationship,v_first_name,v_mid_name,v_last_name,v_suffix,branch,rank,war "Emil","E","Seibel","","10/02/1920","03/12/2010","139-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","96744","","808-233-3630","Veteran (Self)","Emil","E","Seibel","","US ARMY","SGT","WORLD WAR II",

Script:

for line in csv.DictReader(open('data.csv', 'rb'), delimiter=","):
    print line

Output:

{'v_last_name': 'Seibel', None: [''], 'cem_addr_two': '', 'rank': 'SGT', 'd_suffix': '', 'city': 'KANEOHE', 'row_num': '', 'zip': '96744', 'cem_phone': '808-233-3630', 'd_
last_name': 'Seibel', 'd_mid_name': 'E', 'state': 'HI', 'branch': 'US ARMY', 'd_first_name': 'Emil', 'war': 'WORLD WAR II', 'v_mid_name': 'E', 'cem_url': '', 'cem_name': '
HAWAII STATE VETERANS CEMETERY', 'relationship': 'Veteran (Self)', 'v_first_name': 'Emil', 'section_id': '139-E', 'v_suffix': '', 'site_num': '3', 'cem_addr_one': 'KAMEHAM
EHA HIGHWAY', 'd_birth_date': '10/02/1920', 'd_death_date': '03/12/2010'}

csv.DictReader is supposed to automatically get the field names from the first row in the file is the fieldnames parameter is ommited, as described in the docs.

The None: [''] in the output is caused by the trailing comma on each line of data.

Working code example:

http://codepad.org/HdBhr4La


Looking at the original file, which I downloaded here, it is valid CSV. I mistook the output from your script.

Since your using the csv.DictReader each row is turned into a Dictionary with the header values as keys and the data for each as values. I ran it on the same file and it looks like everything is matched up correctly, although I didn't go through the entire thing.

According to the python docs

class csv.DictReader(csvfile[, fieldnames=None[, restkey=None[, restval=None[, dialect='excel'[, *args, **kwds]]]]])

Create an object which operates like a regular reader but maps the information read into a dict whose keys are given by the optional fieldnames parameter. If the fieldnames parameter is omitted, the values in the first row of the csvfile will be used as the fieldnames. If the row read has more fields than the fieldnames sequence, the remaining data is added as a sequence keyed by the value of restkey. If the row read has fewer fields than the fieldnames sequence, the remaining keys take the value of the optional restval parameter. Any other optional or keyword arguments are passed to the underlying reader instance.

If this isn't the format you want it in, you might want to try the csv.reader which will just return a list for each row and not associate it with the headers.

To use the above DictReader, this is probably what you want:

import csv
reader = csv.DictReader(open('ngl_hawaii.csv', 'rb'), delimiter=','))
for row in reader:
    print row['d_first_name']
    print row['d_last_name']


Just tried this and it works fine with your file (renamed to foo)

import csv

ifile  = open('foo.csv', "rb")
reader = csv.reader(ifile)

rownum = 0
for row in reader:
    # Save header row.
    if rownum == 0:
        header = row
    else:
        colnum = 0
        for col in row:
            print '%-8s: %s' % (header[colnum], col)
            colnum += 1

    rownum += 1

ifile.close()

OUTPUT=

d_first_name: Emil
d_mid_name: E
d_last_name: Seibel
d_suffix: 
d_birth_date: 10/02/1920
d_death_date: 03/12/2010
section_id: 139-E
row_num : 
site_num: 3
cem_name: HAWAII STATE VETERANS CEMETERY
cem_addr_one: KAMEHAMEHA HIGHWAY
cem_addr_two: 
city    : KANEOHE
state   : HI
zip     : 96744
cem_url : 
cem_phone: 808-233-3630
relationship: Veteran (Self)
v_first_name: Emil
v_mid_name: E
v_last_name: Seibel
v_suffix: 
branch  : US ARMY
rank    : SGT
war     : WORLD WAR II


(1) The reported result appears to have been butchered.

Purportedly the result of printing a Python dictionary, it should be capable of being parsed back to a dictionary by Python. Not so; here it is with deletions necessary to get Python to parse it:

d = {'v_last_name': None, 'cem_addr_two': None, 'rank': None,
 'd_suffix': None, 'city': None, 'row_num': None, 'zip': None,
 'cem_phone': None, 'd_last_name': None,
 # e,
 'd_first_name': 'Joe,"E","JoJo","","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","11144 "SXXXXX","","US ARMY","SGT","WORLD WAR II"',
 'war': None, 'v_mid_name': None, 'cem_url': None, 'cem_name': None,
 'relationship': None, 'v_first_name': None,
 # 'se one,
 'cem_addr_one': None, 'd_birth_date': None, 'd_death_date': None}

Compared with the heading line of the actual data file, the following column headings are missing:

'section_id', 'v_suffix', 'd_mid_name', 'state', 'branch', 'site_num'

(2) The reported first data line appears to be an edited version of the actual first data line.

reported: Joe,   "E","JoJo",  "","10/02/1920","03/12/2000","100-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","111444","","SXXXXX",      "Veteran (Self)","Joe", "E","JoJo",  "","US ARMY","SGT","WORLD WAR II" 
actual  : "Emil","E","Seibel","","10/02/1920","03/12/2010","139-E","","3","HAWAII STATE VETERANS CEMETERY","KAMEHAMEHA HIGHWAY","","KANEOHE","HI","96744", "","808-233-3630","Veteran (Self)","Emil","E","Seibel","","US ARMY","SGT","WORLD WAR II",
changed : xxxxxx      xxxxxx                           x     xx                                                                                    xxxxxx      xxxxxxxxxxxx                    xxxx       xxxxxx                                   x

As noted by @Acorn, there is a superflous comma at the end of each actual data line. Also note that the first reported field is not quoted.

(3) The reported data appears NOT to be the result of opening the file with Excel followed by saving as csv.

Excel does only minimal quoting when saving as csv i.e. a field is quoted only if necessary. Experimentation shows no quotes in the first data line. The second data line ends in US MARINE CORPS,GYSGT,"KOREA, VIETNAM" ... quotes being necessitated by the embedded comma.

(4) Further analysis of the reported result

Ignoring the the butchery noted in (1) above, the reported dictionary has some interesting features:

(a) The reported dictionary has most of the expected keys, so it's a reasonable conclusion that csv.DictReader managed to parse the heading line OK.

(b) The key d_first_name (i) is the first in the heading line (ii) has a value that is the unsplit whole of the reported first data line. All of the other keys have value None.

This is consistent with the heading line using the same non-comma delimiter as DictReader, but the first data line using a comma delimiter. Note that the reported code has delimiter="," which is (i) unnecessary -- it's the default -- and (ii) not consistent with the reported result.

Conclusion Blaming Excel appears to be unjustified.

0

精彩评论

暂无评论...
验证码 换一张
取 消