In short: There is a huge difference in insert speeds when I keep a JSON object with many fields in a single, string field of MongoDB, or keep each field of the JSON object in its own MongoDB field. 1) Is this difference normal? 2) Are these insertion speeds typical?
I have many records each with a unique string id and 600 integer values. They are already represented as JSON objects in a file -- each document on a separate line. If I represent a MongoDB document as a collection of integer fields and put my unique id into the _id
field of MongoDB, I can insert around 50 documents per second. If I instead create a document with only two fields (_id
for the unique string id, and val
as a single string which keeps the entire JSON line of the record) I can insert around 100 documents per second.
I am using the Python client and have tried doing batch inserts (e.g., 10, 100, 1000 at a time). The difference is always there. Is this behavior expected? I naively assumed that I wouldn't see a difference because MongoDB itself keeps the records as BSON and there really shouldn't be much difference between having 600 fields each with an integer or a single string which contains a JSON record which, in turn, keeps 600 integers.
Addendum: 1) I do the conversion from JSON to a dictionary in both cases, to make sure it's not affecting the speed measurement(i.e., json.loads
and other stuff). In other words, in the single-field-with-JSON-string case, I do everything I do in the other case, but ignore the converted dictionary.
2) I also tried a dry-run, everything intact without any insert to MongoDB. I can process around 700-800 lines per second.
3)
a. db.test.stats() in single-line-single field case (i.e. fast case):
{
"ns" : "tmp.test",
"count" : 7999,
"size" : 71262392,
"avgObjSize" : 8908.91261407676,
"storageSize" : 88751616,
"numExten开发者_StackOverflow中文版ts" : 9,
"nindexes" : 1,
"lastExtentSize" : 21742848,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 466944,
"indexSizes" : {
"_id_" : 466944
},
"ok" : 1
}
b. db.test.stats() (each column to a separate case; i.e., slow case):
{
"ns" : "tmp.test",
"count" : 7999,
"size" : 85710500,
"avgObjSize" : 10715.15189398675,
"storageSize" : 107561984,
"numExtents" : 9,
"nindexes" : 1,
"lastExtentSize" : 26091264,
"paddingFactor" : 1,
"flags" : 1,
"totalIndexSize" : 466944,
"indexSizes" : {
"_id_" : 466944
},
"ok" : 1
}
If possible, enable the C extensions, as they will provide a significant performance improvement. I think the difference in speed is due to the larger number of keys that must be serialized (by pure-Python code, since you have the extensions disabled) into the BSON document. With the C extensions enabled, this will be much faster (but must still be done), so I suspect you will still see a (very slight) difference in speed between the two approaches.
Edit: Note that when I say "enable the C extensions," I mean re-build pymongo, or use a pre-built binary for your platform that has the C modules built. You can see the available binary packages at http://pypi.python.org/pypi/pymongo/2.0.1#downloads
精彩评论