I have a PostgreSQL database with a certain structure and I have several million of xml files. I have to parse each file and, get certain data and fill the tables in the database. What I want to know is the most optimal language/framework/algorithm to perform this routine.
I wrote a program in C# (Mono) using DbLinq ORM. It does not use threading, it just parses file by file, filles table objects and submits certain group of objects开发者_运维技巧 (for example 200) to the database. It appears to be rather slow: it processes about 400 files per minute and it will take about a month to finish the job. I ask for your thoughts and tips.I think it would be faster when you'll use small programs in a pipe that will:
join your files into one big stream;
parse input stream and generate an output stream in PostgreSQL COPY format - the same format pg_dump uses when creating backups, similar to tab-separated-values, looks like this:
COPY table_name (table_id, table_value) FROM stdin; 1 value1 2 value2 3 value3 \.
- load COPY stream into Postgresq started temporarily with "-F" option to disable fsync calls.
For example on Linux:
find -name \*.xml -print0 | xargs -0 cat \
| parse_program_generating_copy \
| psql dbname
Using COPY is much faster than inserting with ORM. Joining files will parallelise reading and writing to database. Disabling "fsync" will allow for big speedup, but will require restoring a database from backup if a server crashes during loading.
Generally I believe Perl is a good option for parsing tasks. I do not know Perl myself. It sounds to me is that you have so extreme performance demands that you might need to create an XML parser as the performance of a standard one might become bottleneck (you should test this before you start implementing). I myself use Python and psycopg2 to communicate with Postgres.
Whichever language you choose, you certainly want to use COPY FROM and probably stdin using Perl/Python/other language to feed data into Postgres.
Instead of spending a lot of time optimizing everything, you could also use a suboptimal solution and run it in extreme parallel on say 100 EC2 instances. This would be a lot cheaper than spending hours and hours on finding the optimal solution.
Without knowing anything about the size of the files 400 files per minute does not sound TOO bad. Ask yourself whether it is worth spending a week of development to reduce the time to a third or just running it now and wait for a month.
精彩评论