Need more efficient way to parse out csv file in Python_问答_开发者

Here's a sample csv file

id, serial_no
2, 500
2, 501
2, 502
3, 600
3, 601

This is the output I'm looking for (list of serial_no withing a list of ids):

[2, [500,501,502]]
[3, [600, 601]]

I have implemented my solution but it's too much code and I'm sure there are better solutions out there. Still learning Python and I don't know all the tricks yet.

file = 'test.csv'

data = csv.reader(open(file))
fields = data.next()

for row in data:
  each_row = []     
    each_row.append(row[0])
    each_row.append(row[1])
    zipped_data.append(each_row)
for rec in zipped_data:
  if rec[0] not in ids:
    ids.append(rec[0])
for id in ids:
    for rec in zipped_data:
      if rec[0] == id:
        ser_no.append(rec[1])
  tmp.append(id)
  tmp.append(ser_no)
  print tmp
  tmp = []
  ser_no = []

**I've omitted var initializing for simplicity of code

print tmp

Gives me output I mentioned above. I know there's a better way to do this or pythonic way to do it. It's just too messy! Any suggestions would be great!

from collections import defaultdict

records = defaultdict(list)

file = 'test.csv'

data = csv.reader(open(file))
fields = data.next()

for row in data:
    records[row[0]].append(row[1])

#sorting by ids since keys don't maintain order
results = sorted(records.items(), key=lambda x: x[0])
print results

If the list of serial_nos need to be unique just replace defaultdict(list) with defaultdict(set) and records[row[0]].append(row[1]) with records[row[0]].add(row[1])

Instead of a list, I'd make it a collections.defaultdict(list), and then just call the append() method on the value.

result = collections.defaultdict(list)
for row in data:
  result[row[0]].append(row[1])

Here's a version I wrote, looks like there are plenty of answers for this one already though.

You might like using csv.DictReader, gives you easy access to each column by field name (from the header / first line).

#!/usr/bin/python
import csv

myFile = open('sample.csv','rb')
csvFile = csv.DictReader(myFile)
# first row will be used for field names (by default)

myData = {}

for myRow in csvFile:
    myId = myRow['id']
    if not myData.has_key(myId): myData[myId] = []
    myData[myId].append(myRow['serial_no'])

for myId in sorted(myData):
    print '%s %s' % (myId, myData[myId])

myFile.close()

Some observations:

0) file is a built-in (a synonym for open), so it's a poor choice of name for a variable. Further, the variable actually holds a file name, so...

1) The file can be closed as soon as we're done reading from it. The easiest way to accomplish that is with a with block.

2) The first loop appears to go over all the rows, grab the first two elements from each, and make a list with those results. However, your rows already all contain only two elements, so this has no net effect. The CSV reader is already an iterator over rows, and the simple way to create a list from an iterator is to pass it to the list constructor.

3) You proceed to make a list of unique ID values, by manually checking. A list of unique things is better known as a set, and the Python set automatically ensures uniqueness.

4) You have the name zipped_data for your data. This is telling: applying zip to the list of rows would produce a list of columns - and the IDs are simply the first column, transformed into a set.

5) We can use a list comprehension to build the list of serial numbers for a given ID. Don't tell Python how to make a list; tell it what you want in it.

6) Printing the results as we get them is kind of messy and inflexible; better to create the entire chunk of data (then we have code that creates that data, so we can do something else with it other than just printing it and forgetting it).

Applying these ideas, we get:

filename = 'test.csv'

with open(filename) as in_file:
    data = csv.reader(in_file)
    data.next() # ignore the field labels
    rows = list(data) # read the rest of the rows from the iterator

print [
    # We want a list of all serial numbers from rows with a matching ID...
    [serial_no for row_id, serial_no in rows if row_id == id]
    # for each of the IDs that there is to match, which come from making
    # a set from the first column of the data.
    for id in set(zip(*rows)[0])
]

We can probably do even better than this by using the groupby function from the itertools module.

example using itertools.groupby. This only works if the rows are already grouped by id

from csv import DictReader
from itertools import groupby
from operator import itemgetter

filename = 'test.csv'

# the context manager ensures that infile is closed when it goes out of scope
with open(filename) as infile:

    # group by id - this requires that the rows are already grouped by id
    groups = groupby(DictReader(infile), key=itemgetter('id'))

    # loop through the groups printing a list for each one
    for i,j in groups:
        print [i, map(itemgetter(' serial_no'), list(j))]

note the space in front of ' serial_no'. This is because of the space after the comma in the input file