I have 2 big CSV file with millions of rows. Because those 2 CSVs are from MySQL, I want to merge those 2 tables into one Document in couch DB.
What is the most efficient way to do this?
My current method is:
- import 1st CSV
- import 2nd CSV
To prevent duplication, the program will search the Document with the key for each row. after the row is found, then the Document is updated with the columns from the 2开发者_如何学编程nd CSV
The problem is, it really takes a long time to search for each row. while importing the 2nd CSV, it updates 30 documents/second and I have about 7 million rows. rough calculation, it will take about 64 hours to complete the whole importing.
Thank You
It sounds like you have a "primary key" that you know from the row (or you can compute it from the row). That is ideal as the document _id
.
The problem is, you get 409 Conflict
if you try to add the 2nd CSV data but there was already a document with the same _id
. Is that correct? (If so, please correct me so I can fix the answer.)
I think there is a good answer for you:
Use _bulk_docs to import everything, then fix the conflicts.
Begin with a clean database.
Use the Bulk docuent API to insert all the rows from the 1st and then 2nd CSV set—as many as possible per HTTP query, e.g. 1,000 at a time. (Bulk docs are much faster than inserting one-by-one.)
Always add "all_or_nothing": true
in your _bulk_docs
POST data. That will guarantee that every insertion will be successful (assuming no disasters such as power loss or full HD).
When you are done, some documents will be conflicted, which means you inserted twice for the same _id
value. That is no problem. Simply follow this procedure to merge the two versions:
- For each
_id
that has conflicts, fetch it from couch byGET /db/the_doc_id?conflicts=true
. - Merge all the values from the conflicting versions into a new final version of the document.
- Commit the final, merged document into CouchDB and delete the conflicted revisions. See the CouchDB Definitive Guide section on conflict resolution. (You can use
_bulk_docs
to speed this up too.)
Example
Hopefully this will clarify a bit. Note, I installed the *manage_couchdb* couchapp from http://github.com/iriscouch/manage_couchdb. It has a simple view to show conflicts.
$ curl -XPUT -Hcontent-type:application/json localhost:5984/db
{"ok":true}
$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary @-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
, "first_value": "This is the first value"
}
, { "_id": "some_id"
, "second_value": "The second value is here"
}
]
}
[{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"},{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"}]
$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":2,"offset":0,"rows":[
{"id":"some_id","key":["some_id","1-0cb8fd1fd7801b94bcd2f365ce4812ba"],"value":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba"},"doc":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba","first_value":"This is the first value"}},
{"id":"some_id","key":["some_id","1-d1b74e67eee657f42e27614613936993"],"value":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993"},"doc":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993","second_value":"The second value is here"}}
]}
$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary @-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
, "_rev": "1-0cb8fd1fd7801b94bcd2f365ce4812ba"
, "first_value": "This is the first value"
, "second_value": "The second value is here"
}
, { "_id": "some_id"
, "_rev": "1-d1b74e67eee657f42e27614613936993"
, "_deleted": true
}
]
}
[{"id":"some_id","rev":"2-df5b9dc55e40805d7f74d1675af29c1a"},{"id":"some_id","rev":"2-123aab97613f9b621e154c1d5aa1371b"}]
$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":0,"offset":0,"rows":[]}
$ curl localhost:5984/db/some_id?conflicts=true\&include_docs=true
{"_id":"some_id","_rev":"2-df5b9dc55e40805d7f74d1675af29c1a","first_value":"This is the first value","second_value":"The second value is here"}
The final two commands show that there are no conflicts, and the "merged" document is now served as "some_id".
Another option is simply to do what you are doing already, but use the bulk document API to get a performance boost.
For each batch of documents:
POST to
/db/_all_docs?include_docs=true
with a body like this:{ "keys": [ "some_id_1" , "some_id_2" , "some_id_3" ] }
Build your
_bulk_docs
update depending on the results you get.- Doc already exists, you must update it:
{"key":"some_id_1", "doc": {"existing":"data"}}
- Doc does not exist, you must create it:
{"key":"some_id_2", "error":"not_found"}
- Doc already exists, you must update it:
POST to
/db/_bulk_docs
with a body like this:{ "docs": [ { "_id": "some_id_1" , "_rev": "the _rev from the previous query" , "existing": "data" , "perhaps": "some more data I merged in" } , { "_id": "some_id_2" , "brand": "new data, since this is the first doc creation" } ] }
精彩评论