I have a stack of data in 200,000+ XML files, which are updated/created every week. So, each week, I have to parse each XML file, check whether there are new XML files being created, then update my database with all updated data. In between, if there is new XML file, then my database will not have that record, so I have to create a new row.
Here's my workflow plan:
- Put all table X row ID into array A. Let's say 200,000 numerical val开发者_运维知识库ues.
- Parse each XML files and gather each XML's ID (ID will be the same as my table X ID), and store in array B. Let's say now I have 200,010 numerical values, with 10 new records compared to my current table X.
- Compare array A and array B to see which values do not exist.
- Put the 10 new values into array C.
- Create new record in table X with that 10 new IDs from array C.
- Parse each XML file again, and store the desired values to my table X row-column.
- So now my table X will have 200,010 records, and each are updated, and the new 10 records will also now in the table X.
The problem I have to do this is because I can't get any information of any new XML file created by the vendor. They just give me a stack of files.
Any better way to do it? I'm worried that my system will crash when they compare two arrays with 200,000+ values. Thanks.
I had to do something similar. In the end I did it like this:
Baseline the whole setup by iterating over each record in all XML files, normalizing the item (removing newlines, cleaning up whitespace, substituting certain characters) and then doing a per record MD5 sum. Also import the record.
When I get new data, I iterate through the records (SAX would be a good idea), if the record is not yet in the DB (based on UID) or has changed (based on MD5 sum) it gets imported.
This works pretty well for what we usually need it for (around 350k records spread through around 100 files), but also worked ok-ish with (much) more data. It's a wild mix of several tools, including Bash, AWK, sed, grep, the wonderful XMLStarlet and Ruby, and would be in dire need of a proper rewrite.
Don't compare the files in your ruby script. Use MD5 Sums. That should work unless the content of the XML is changed. And then it is not the same XML at all.
A few suggestions:
- If the XML files are big, use a SAX Parser for parsing (you don't need to keep the whole file in memory).
- Sort the array A, so you can use a binary search.
- Iteratively read each XML-file, if the ID from the file is not within your array A, store it in your table X (no need to keep a second array B).
Looks like a pretty good solution. A little change will make it faster.
In step 3 you "Compare array A and array B to see which values do not exist." I assume you are iterating through array B (xml), and looking values in array A (database).
When you find a value in array A, be sure to remove it from array A (database), making array A smaller and smaller as you go through the loop.
Assuming that you ignore XML files with existing IDs, rather than importing any changes in them (your question implies that existing files don't change, or that you don't care about changes), a slight modification of your procedure will be optimal:
- Load all the existing IDs from your database into a set, rather than an array.
- Parse each XML file, extracting its ID, and check if it's in the set.
- If the file is not in the set, process it and add it to the database.
There's no need to do two passes over your data, because you can check if the file has already been processed, then immediately process it if it hasn't.
精彩评论