MongoDB insert speed trade-offs_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-29 21:35 出处：网络

In short: There is a huge difference in insert speeds when I keep a JSON object with many fields in a single, string field of MongoDB, or keep each field of the JSON object in its own MongoDB field. 1

In short: There is a huge difference in insert speeds when I keep a JSON object with many fields in a single, string field of MongoDB, or keep each field of the JSON object in its own MongoDB field. 1) Is this difference normal? 2) Are these insertion speeds typical?

I have many records each with a unique string id and 600 integer values. They are already represented as JSON objects in a file -- each document on a separate line. If I represent a MongoDB document as a collection of integer fields and put my unique id into the _id field of MongoDB, I can insert around 50 documents per second. If I instead create a document with only two fields (_id for the unique string id, and val as a single string which keeps the entire JSON line of the record) I can insert around 100 documents per second.

I am using the Python client and have tried doing batch inserts (e.g., 10, 100, 1000 at a time). The difference is always there. Is this behavior expected? I naively assumed that I wouldn't see a difference because MongoDB itself keeps the records as BSON and there really shouldn't be much difference between having 600 fields each with an integer or a single string which contains a JSON record which, in turn, keeps 600 integers.

Addendum: 1) I do the conversion from JSON to a dictionary in both cases, to make sure it's not affecting the speed measurement(i.e., json.loads and other stuff). In other words, in the single-field-with-JSON-string case, I do everything I do in the other case, but ignore the converted dictionary.

2) I also tried a dry-run, everything intact without any insert to MongoDB. I can process around 700-800 lines per second.

a. db.test.stats() in single-line-single field case (i.e. fast case):
{
    "ns" : "tmp.test",
    "count" : 7999,
    "size" : 71262392,
    "avgObjSize" : 8908.91261407676,
    "storageSize" : 88751616,
    "numExten开发者_StackOverflow中文版ts" : 9,
    "nindexes" : 1,
    "lastExtentSize" : 21742848,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 466944,
    "indexSizes" : {
        "_id_" : 466944
    },
    "ok" : 1
}

b. db.test.stats() (each column to a separate case; i.e., slow case):
{
    "ns" : "tmp.test",
    "count" : 7999,
    "size" : 85710500,
    "avgObjSize" : 10715.15189398675,
    "storageSize" : 107561984,
    "numExtents" : 9,
    "nindexes" : 1,
    "lastExtentSize" : 26091264,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 466944,
    "indexSizes" : {
        "_id_" : 466944
    },
    "ok" : 1
}

If possible, enable the C extensions, as they will provide a significant performance improvement. I think the difference in speed is due to the larger number of keys that must be serialized (by pure-Python code, since you have the extensions disabled) into the BSON document. With the C extensions enabled, this will be much faster (but must still be done), so I suspect you will still see a (very slight) difference in speed between the two approaches.

Edit: Note that when I say "enable the C extensions," I mean re-build pymongo, or use a pre-built binary for your platform that has the C modules built. You can see the available binary packages at http://pypi.python.org/pypi/pymongo/2.0.1#downloads