I am using MongoDB as a temporary log store. The collection receives ~400,000 new rows an hour. Each row contains a UNIX timestamp and a JSON string.
Periodically I would like to copy the contents of the collection to a file on S3, creating a file for each hour containing ~400,000 rows (eg. today_10_11.log contains all the rows received between 10am and 11am). I need to do this copy while the collection is receiving inserts.
My question: what is the performance impact of having an index on the timestamp column on the 4开发者_StackOverflow社区00,000 hourly inserts verses the additional time it will take to query an hours worth of rows.
The application in question is using written in Ruby running on Heroku and using the MongoHQ plugin.
Mongo indexes the _id field by default, and the ObjectId already starts with a timestamp, so basically, Mongo is already indexing your collection by insertion time for you. So if you're using the Mongo defaults, you don't need to index a second timestamp field (or even add one).
To get the creation time of an object id in ruby:
ruby-1.9.2-p136 :001 > id = BSON::ObjectId.new
=> BSON::ObjectId('4d5205ed0de0696c7b000001')
ruby-1.9.2-p136 :002 > id.generation_time
=> 2011-02-09 03:11:41 UTC
To generate the object ids for a given time:
ruby-1.9.2-p136 :003 > past_id = BSON::ObjectId.from_time(1.week.ago)
=> BSON::ObjectId('4d48cb970000000000000000')
So, for example, if you wanted to load all docs inserted in the past week, you'd simply search for _ids greater than past_id and less than id. So, through the Ruby driver:
collection.find({:_id => {:$gt => past_id, :$lt => id}}).to_a
=> #... a big array of hashes.
You can, of course, also add a separate field for timestamps, and index it, but there's no point in taking that performance hit when Mongo's already doing the necessary work for you with its default _id field.
More information on object ids.
I have an application like yours, and currently it has 150 million log records. At 400k an hour, this DB will get large fast. 400k inserts an hour with indexing on timestamp will be much more worthwhile than doing an unindexed query. I have no problem with inserting tens of millions of records in an hour with indexed timestamp, yet if I do an unindexed query on timestamp it takes a couple of minutes on a 4 server shard (cpu bound). Indexed query comes up instantly. So definitely index it, the write overhead on indexing is not that high and 400k records an hour is not much for mongo.
One thing you do have to look out for is memory size though. At 400k records an hour you are doing 10 million a day. That would consume about 350MB of memory a day to keep that index in memory. So if this goes for a while your index can get larger than memory fast.
Also, if you are truncating records after some time period using remove, I have found that removes create a large amount of IO to disk and it is disk bound.
Certainly on every write you will need to update the index data. If you're going to be doing large queries on the data you will definitely want an index.
Consider storing the timestamp in the _id field instead of a MongoDB ObjectId. As long as you are storing unique timestamps you'll be OK here. _id doesn't have to be an ObjectID, but has an automatic index on _id. This may be your best bet as you won't add an additional index burden.
I'd just use a capped collection, unindexed, with space for, say 600k rows, to allow for slush. Once per hour, dump the collection to a text file, then use grep to filter out rows that aren't from your target date. This doesn't let you leverage the nice bits of the DB, but it means you don't have to ever worry about collection indexes, flushes, or any of that nonsense. The performance-critical bit of it is keeping the collection free for inserts, so if you can do the "hard" bit (filtering by date) outside of the context of the DB, you shouldn't have any appreciable performance impact. 400-600k lines of text is trivial for grep, and likely shouldn't take more than a second or two.
If you don't mind a bit of slush in each log, you can just dump and gzip the collection. You'll get some older data in each dump, but unless you insert over 600k rows between dumps, you should have a continuous series of log snapshots of 600k rows apiece.
精彩评论