I'm new to Python and have what is probably a very basic question about the 'be开发者_Go百科st' way to store data in my code. Any advice much appreciated!
I have a long .csv file in the following format:
Scenario,Year,Month,Value
1,1961,1,0.5
1,1961,2,0.7
1,1961,3,0.2
etc.
My scenario values run from 1 to 100, year goes from 1961 to 1990 and month goes from 1 to 12. My file therefore has 100*29*12 = 34800 rows, each with an associated value.
I'd like to read this file into some kind of Python data structure so that I can access a 'Value' by specifying the 'Scenario', 'Year' and 'Month'. What's the best way to do this please (or what are the various options)?
In my head I think of this data as a kind of 'number cuboid' with axes for Scenario, Year and Month, so that each Value is located at co-ordinates (Scenario, Year, Month). For this reason, I'm tempted to try to read these values into a 3D numpy array and use Scenario, Year and Month as indices. Is this a sensible thing to do?
I guess I could also make a dictionary where the keys are something like
str(Scenario)+str(Year)+str(Month)
Would this be better? Are there other options?
(By 'better' I suppose I mean 'faster to access', although if one method is much less memory intensive than another it'd be good to know about that too).
Thanks very much!
I'd use a dict of tuples. Simple, fast, and a hash-table look-up to retrieve a single value:
import csv
reader = csv.reader(open('data.csv', 'rb'))
header = reader.next()
data = {}
for row in reader:
key = tuple([int(v) for v in row[:-1]])
val = row[-1]
data[key] = float(val)
# Retrieve a value
print data[1, 1961, 3]
I would use sqlite3 for storing the data to disk. You'll be able to read in the full data set or subsets through SQL queries. You can then load that data into a numpy array or other Python data structure -- whatever is most convenient for the task.
If you do choose to use sqlite, also note that sqlite has a TIMESTAMP data type.
It may be a good idea to combine the year and month into one TIMESTAMP. When you read TIMESTAMPs into Python, sqlite3
can be told to automatically convert the TIMESTAMPs into datetime.datetime
objects, which would reduce some of the boilerplate code you'd otherwise have to write. It will also make it easier to form SQL queries which ask for all the rows between two dates.
sqlite is a nice option if you're going to access your values by different parameters each time.
If that's not the case, and you'll always access by this triplet (scenario, year, month), you can use a Tuple (immutable list) as your key, and the value as your value.
In code it would look like:
d = {}
d[1, 1961, 12] = 0.5
or in more generic loop code:
d[scenario, year, month] = value
later on you can just access it with:
print d[scenario, year, month]
Python will automatically create the Tuple for you.
Make a dictionary of dictionaries of dictionaries like you described. If you need data as numbers, convert them to numbers once when your read them and stores numbers in the dicts. It will be faster then using strings as keys. Let me know if need a help with the code.
精彩评论