I am running a sharded mongodb environment - 3 mongod shards, 1 mongod config, 1 mongos (no replication).
I want to use mongoimport to import csv data into the database. I have 105million records stored in increments of 500,000 across 210 csv files. I understand that mongoimport is single threaded and I read that I should run multiple mongoimport processes to get better perform开发者_如何学Goance. However, I tried that and didn't get a speed up:
when running 3 mongoimports in parallel, I was getting ~6k inserts/sec per process (so 18k i/s) vs. running 1 mongoimport, I was getting ~20k inserts/sec.
Since these processes were routed through the single mongod config and mongos, I am wondering if this is due to my cluster configuration. My question is, if I set up my cluster configuration differently, will I achieve better mongoimport speeds? Do I want more mongos processes? How many mongoimports processes should I fire off at a time?
So, the first thing you need to do is "pre-split" your chunks.
Let's assume that you have already sharded the collection to which you're importing. When you start "from scratch", all of the data will start going to a single node. As that node fills up, MongoDB will start "splitting" that node into chunks. Once it gets to around 8 chunks (that's about 8x64MB of index space), it will start migrating chunks.
So basically, you're effectively writing to a single node and then that node is being slowed down because it has to read and write its data to the other nodes.
This is why you're not seeing any speedup with 3 mongoimport
. All of the data is still going to a single node and you're maxing out that node's throughput.
The trick here is to "pre-split" the data. In your case, you would probably set it up so that you get about 70 files worth of data on each machine. Then you can import those files on different threads and get better throughput.
Jeremy Zawodny of Craigslist has a reasonable write-up on this here. The MongoDB site has some docs here.
I've found a few things help with bulk loads.
Defer building indexes (except for the one you have to on the shard key) until after you've loaded everything.
Run one mongos and mongoimport per shard and load in parallel.
And the biggest improvement: Presplit your chunks. This is a bit tricky since you need to figure out how many chunks you will need and roughly how the data is distributed. After you split them, you have to wait for the distributor to move them all around.
精彩评论