开发者

MongoDB - Consuming the Tweets and counting data

开发者 https://www.devze.com 2023-03-08 23:46 出处:网络
I am using the Twitter realtime streaming API to keep an active count of particular tracks. For example, I want to keep track of the number of times \'apple\', \'orange\' and \'pear\' are tweeted. I\'

I am using the Twitter realtime streaming API to keep an active count of particular tracks. For example, I want to keep track of the number of times 'apple', 'orange' and 'pear' are tweeted. I'm using Mongo to store the tweet data but I have a question as to how is best to do get the count for each of the tracks I am following.

I will be running this query once开发者_如何转开发 every second to get a close to realtime count for each track, so I need to ensure I am doing it in the right way:

Option 1

Run a count query against a particular track

 db.tweets.count({track: 'apple'})

Considering the tweet database will hold a LOT of data (potentially millions) I wonder if it this might be a bit slow?

Option 2

Create a second collection, 'track_count' and update a 'count' attribute each time a new tweet comes in:

{track:'apple', count:0}
{track:'orange', count:0}
{track:'pear', count:0}

Then when a new tweet comes in:

db.track_count.update( { track:"apple" }, { $inc: { count : 1 } } );

I can then keep an up to date count for each track, but it means writing to the database twice, once for the tweet and again to increment the tracks count. Baring in mind there could be a fair number (tens, perhaps hundreds) of tweets coming in per second.

Does anyone have any suggestions as to the best method to do this?


Without doubt, use a separate track_count collection to keep a running total of the number of matches. Otherwise you'll be re-querying your entire tweets collection every second, which will become very slow and expensive as data volume grows.

Don't worry about writing to the database twice, once to store the tweet, then again to increment the counter. Writes in MongoDB are extremely fast, and this solution will scale well beyond thousands of tweets per second, even on a single non-clustered Mongo instance.


Does anyone have any suggestions as to the best method to do this?

There is no "best" method here. This is a classic trade-off. You can do "counters", you can suffer slow queries, you can run regular map-reduce jobs.

  • Two writes => faster queries, more write activity
  • One write => slower queries, less write activity
  • Hourly M/R => slightly stale data, slightly more writes

Typically the suggestion is to use the counters. MongoDB tends to be pretty good at handling large write loads, especially this type of "increment" or counters load.

You don't get more speed unless you sacrifice something. Disk, RAM, CPU. So you'll have to select your trade-off based on your needs.


Side note: is the track name unique?

You may want to try the following:

{_id:'orange', count:0}
{_id:'pear', count:0}

Or for counts by day:

{_id:'orange_20110528', count:0}
{_id:'orange_20110529', count:0}
{_id:'pear_20110529', count:0}
0

精彩评论

暂无评论...
验证码 换一张
取 消