I have about 30 MB of textual data that is core to the algorithms I use in my web application.
On the one hand, the data is part of the algorithm and changes to the data can cause an entire algorithm to fail. This is why I keep the data in text files in my source control, and all changes are auto-tested (pre-commit). I currently have a good level of control. Distributing the data along with the source as we spawn more web instances is a non-issue because it tags along with the source. I currently have these problems:
- I often develop special tools to manipulate the files, replicating database access tool functionality
- I would like to give non-developers controlled web-access to this data.
On the other hand, it is data, and it "belongs" in a database. I wish I could place it in a database, but then I would have these problems:
- How do I sync this database to the source? A release contains both code and data.
- How do I ship it with the data as I spawn a new instanc开发者_如何学Goe of the web server?
- How do I manage pre-commit testing of the data?
Things I have considered thus-far:
- Sqlite (does not solve the non-developer access)
- Building an elaborate pre-production database, which data-users will edit to create "patches" to the "real" database, which developers will accept, test and commit. Sounds very complex. I have not fully designed this yet and I sure hope I'm reinventing the wheel here and some SO user will show me the error of my ways...
BTW: I have a "regular" database as well, with things that are not algorithmic-data.
BTW2: I added the Python tag because I currently use Python, Django, Apache and Nginx, Linux (and some lame developers use Windows).Thanks in advance!
UPDATE
Some examples of the data (the algorithms are natural language processing stuff):
- World Cities and their alternate names
- Names of currencies
- Coordinates of hotels
The list goes on and on, but Imagine trying to parse the sentence Romantic hotel for 2 in Rome arriving in Italy next monday
if someone changes the coordinates that teach me that Rome is in Italy, or if someone adds `romantic' as an alternate name for Las Vegas (yes, the example is lame, but I hope you get the drift).
Okay, here's an idea:
- Ship all the data as is done now.
- Have the installation script install it in the appropriate databases.
- Let users modify this database and give them a button "restore to original" that simply reinstalls from the text file.
Alternatively, this route may be easier, esp. when upgrading an installation:
- Ship all the data as is done now.
- Let users modify the data and store the modified versions in the appropriate database.
- Let the code look in the database, falling back to the text files if the appropriate data is not found. Don't let the code modify the text files in any way.
In either case, you can keep your current testing code; you just need to add tests that make sure the database properly overrides text files.
I'd term this a resource, which is data that your application relies on, but not the data your application manages. Images, CSS, and templates are similar resources, and you keep them all version controlled.
In this case, you could split out your data into a separate package. In python distribution terms, use a separate egg that your application depends on; package deployment tools such as pip and buildout will pull in the dependency automatically. That way you can version it independently, you can have other tools depend on it, etc.
Last, choose a format that can be managed effectively by a source control system. That means a textual format. You can initiate parse that format on start-up, but by keeping it textual you can manage it properly through changes to it. This could be a SQL dump (CREATE TABLE and INSERT statements) to be loaded into a sqlite database on start-up, or some other python-based structure. You can also choose to load data on demand, caching it in-memory as needed.
Have a look at the pytz timezone database for a great example of how another resource project manages such structures.
精彩评论