Storing simulation results in a persistent manner for Python?_问答_开发者

Background: I'm running multiple simulations on a set of data. For each session, I'm allocating projects to students. The difference between each session is that I'm randomising the order of the students such that all the students get a shot at being assigned a project they want. I was writing out some of the allocations in a spreadsheet (i.e. Excel) and it basically looked like this (tiny snapshot, actual table extends to a few thousand sessions, roughly 100 students).

|          | Session 1 | Session 2 | Session 3 |
|----------|-----------|-----------|-----------|
|Stu1      |Proj_AA    |Proj_AB    |Proj_AB    |
|----------|-----------|-----------|-----------|
|Stu2      |Proj_AB    |Proj_AA    |Proj_AC    |
|----------|-----------|-----------|-----------|
|Stu3      |Proj_AC    |Proj_AC    |Proj_AA    |
|----------|-----------|-----------|-----------|

Now, the code that deals with the allocation currently stores a session in an object. The next time the allocation is run, the object is over-written.

Thus what I'd really like to do is to store all the allocation results. This is important since I later need to derive from the data, information such as: which project Stu1 got assigned to the most or perhaps how popular Proj_AC was (how many times it was assigned / number of sessions).

Question(s):

What methods can I possibly use to basically store such session information persistently? Basically, each session output needs to add itself to the repository after ending and before beginning the next allocation cycle.

One solution that was suggested by a friend was mapping these results to a relational database using SQLAlchemy. I kind of like the idea since this does give me an opportunity to delve into databases.

Now the database structure I was recommended was:

|----------|-----------|-----------|
|Session   |Student    |Project    |
|----------|-----------|-----------|
|1         |Stu1       |Proj_AA    |
|----------|-----------|-----------|
|1         |Stu2       |Proj_AB    |
|----------|-----------|-----------|
|1         |Stu3       |Proj_AC    |
|----------|-----------|-----------|
|2         |Stu1       |Proj_AB    |
|----------|-----------|-----------|
|2         |Stu2       |Proj_AA    |
|----------|-----------|-----------|
|2         |Stu3       |Proj_AC    |
|----------|-----------|-----------|
|3         |Stu1       |Proj_AB    |
|----------|-----------|-----------|
|3         |Stu2       |Proj_AC    |
|----------|-----------|-----------|
|3         |Stu3       |Proj_AA    |
|----------|-----------|-----------|

Here, it was suggested that I make the Session and Student columns a composite key. That way I can access a specific record for a particular student for a particular session. Or I can merely get the entire allocation run for a particular session.

Questions:

Is the idea a good one?

How does one implement and query a composite key using SQLAlchemy?

What happens to the database if a particular student is not assigned a project (happens if all projects that he wants are taken)? In the code, if a student is not assigned a project, instead of a proj_id he simply gets None for that field/object.

I apologise for asking multiple questions but since these are closely-related, I thought I'd ask them in the same space.

Edit - 25/03/2010

The class structure for Student currently stands like so:

class Student(DeptPerson):
    def __init__(self, name, stud_id):
        super(Student, self).__init__(name, stud_id)
        self.preferences = collections.defaultdict(set)
        self.allocated_project = None
        self.allocated_rank = 0

(Misc.) It inherits from a class called DeptPerson that just has name and stud_id.

Thus, as the allocation function goes about assigned projects (referenced by a unique ID - Project is its own class) to a student (for a single run/trial/session), it will reflect the value in allocated_project. If a student isn't assigned a project for a particular run (because someone else already ha开发者_开发百科s it, mwhahah...sorry), allocated_project remains as None (this is pretty useful information for me).

A two-dimension display of data is a Relational database table with two keys.

In your example, the Student Key and the Session Key.

A "Composite key" is a piece of noise that you can ignore. It isn't helpful and isn't necessary. A composite key does not solve any problems well and create many difficulties. Pretend you never heard it.

What we do is introduce an additional column with a simple "identifier". It's an "autogenerated, surrogate key" for they row. Autogenerated unique keys for each row are a good thing. Composite keys are a bad thing.

The logical structure you have can be considered a Ternary Relationship, where recommended to you table is corresponding to the Attendance relationship object. Therefore ideally you should also create you object model similar to this:

Storing simulation results in a persistent manner for Python?

_{(source: databasedesignstudio.com)}
.

Now, in your case one could argue why do you need more then 1 table, if you have only one field for each of the Entity tables. But I would still model it this way, as this model better represents the real world, and you still need to store somewhere the Project students prefer to work on, which would be another table with many-to-many relationship to Student table.

Working with entities is better and easier for you to understand sqlalchemy; whereas if you just keep one table, how much will you delve into the database really?

About composite keys: S.Lott gave you good reasons to avoid using them, and I fully agree with his take on the topic.

Can't help you on the db stuff, as I'm a total newb and only know enough to query data from sqlite tables...

For persistence, though, could you use the pickle module to store your objects? Check the docs for the exact usage, but I think it's pretty much file(filename, 'wb').write(pickle.pickle(myobject)) to write it and myobject = pickle.unpickle(file(filename, 'rb')) to read.

Then you can read multiple tables/whatever into multiple variables and do whatever comparisons you want.

If you don't need/want to read it back in via Python, you could also just manually format it as tab delimited or something and load that file into the spreadsheet app of your choice (OpenOffice Calc is fantastic).