I'm building an application to import data into a sql server 2008 Express db.
This database is being used by an application that is currently in production. The data that needs to be imported comes from various sources, mostly excel sheets and xml file开发者_JAVA技巧s.The database has the following tables:
- tools
- powertools
- strikingtools
- owners
Each row, or xml tag in the source files has information about 1 tool:
name, tooltype, weight, wattage, owner, material, etc...Each of these rows has the name of the tool's owner this name has to be inserted into the owners table but only if the name isn't already in there.
For each of these rows a new row needs to be inserted in the tools table. The tools table has a field owner_id with a foreign key to the owners table where the primary key of the corresponding row in the owners table needs to be set Depending on the tooltype a new row must be created in either the powertools table or the strikingtools table. These 2 tables also have a tool_id field with a foreign key to the tools table that must be filled in. The tools table has a tool_owner_id field with a foreign key to the owners table that must be filled in.If any of the rows in the importfile fails to import for some reason, the entire import needs to be rolled back
Currently I'm using a dataset to do this but for some large files (over 200.000 tools) this requires quite a lot of memory. Can anybody think of a better aproach for this?
There are two main issues to be solved:
- Parsing the a large XML document efficiently.
- Adding a large amount of records to the database.
XML Parsing
Although the DataSet approach works, the whole XML document is loaded into memory. To improve the efficiency of working with large XML documents you might want look at the XmlReader class. The API is slightly more difficult to use than what DataSet provides. But you will get the benefit of not loading the whole DOM into memory at once.
Inserting records to the DB
To satisfy your Atomicity requirement you can use a single database transaction but the large number of records you are dealing with for a single transaction is not ideal. You will most likely incur issues like:
- Database having to deal with a large number of locks
- Database locks that might escalate from row locks to page locks and even table locks.
- Concurrent use of the database will be severely affect during the import.
I would recommend the following instead of a single DB transaction:
- See if it possible to create smaller transaction batches. Maybe 100 records at a time. Perhaps it is possible to logically load sections of the XML file together, where it would be acceptable load a subset of the data as a unit into the system.
- Validate as much of your data upfront. E.g. Check that required fields are filled or that FK's are correct.
- Make the upload repeatable. Skip over existing data.
- Provide a manual undo strategy. I know this is easier said than done, but might even be required as an additional business rule. For example the upload was successful but someone realises a couple of hours later that the wrong file was uploaded.
It might be useful to upload your data to a initial staging area in your DB to perform validations and to mark which records have been processed.
Use SSIS, and create and ETL package.
Use Transactions for the roll back feature, and stored procedure that handle creating/checking the foreign keys.
精彩评论