140 characters. How much memory would it take up ?
I'm trying to calc开发者_如何转开发ulate how many tweets my EC2 Large instance Mongo DB can hold.
Twitter uses UTF-8 encoded messages.
UTF-8 code points can be up to six four octets long, making the maximum message size 140 x 4 = 560 8-bit bytes.
This is, of course, just for the raw messages, excluding storage overhead, indexing and other storage-related padding.
e: Twitter successfully let me post the message:
™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™™
Yes, that's 140 trademark symbols, which are three octets each in UTF-8
Back in September, an engineer at Twitter gave a presentation that suggested it's about 200 bytes per tweet.
Of course you still have to account for overhead for your own metadata and the database itself, but 200 bytes/record is probably a good place to start.
Typically it's two bytes per character if you're storing Unicode as UTF-8, so that would mean 280 bytes max per tweet.
Probably 284 bytes in memory ( 4 byte length prefix + length*2). Inside the DB I cannot say but probably 280 if the DB is UTF-8, you could add some bytes of overhead, for metadata etc.
Potentially of interest:
http://mehack.com/map-of-a-twitter-status-object
Anatomy of a Twitter Status Object
Also more about twitter character encoding:
http://dev.twitter.com/pages/counting_characters
It's technically stored as UTF-8, and in reality, the slide deck from a tweeter guy here http://www.slideshare.net/raffikrikorian/twitter-by-the-numbers gives the real stat about it:
140 characters, ~200 bytes
精彩评论