I have hourly logs like
user1:joined
user2:log out
user1:added pic
user1:added comment
user3:joined
I want to compress all the flat files down to one file. There are around 30 million users in the logs and I just want the latest user log for all th开发者_运维知识库e logs.
My end result is I want to have a log look like
user1:added comment
user2:log out
user3:joined
Now my first attempt on a small scale was to just do a dict like
log['user1'] = "added comment"
Will doing a dict of 30 million key/val pairs have a giant memory footprint.. Or should I use something like sqllite to store them.. then just put the contents of the sqllite table back into a file?
If you intern()
each log entry then you'll use only one string for each similar log entry regardless of the number of times it shows up, thereby lowering memory usage a lot.
>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>>> b = 'f' + ('oo',)[0]
>>> a is b
False
>>> a = intern('foo')
>>> b = intern('f' + ('oo',)[0])
>>> a is b
True
You could also process the log lines in reverse -- then use a set to keep track of which users you've seen:
s = set()
# note, this piece is inefficient in that I'm reading all the lines
# into memory in order to reverse them... There are recipes out there
# for reading a file in reverse.
lines = open('log').readlines()
lines.reverse()
for line in lines:
line = line.strip()
user, op = line.split(':')
if not user in s:
print line
s.add(user)
The various dbm modules (dbm in Python 3, or anydbm, gdbm, dbhash, etc. in Python 2) let you create simple databases of key to value mappings. They are stored on the disk so there is no huge memory impact. And you can store them as logs if you wish to.
This sounds like the perfect kind of problem for a Map/Reduce solution. See:
- http://en.wikipedia.org/wiki/MapReduce
- Hadoop
for example.
Its pretty to easy to mock up the data structure to see how much memory it would take.
Something like this where you could change gen_string to generate data that would approximate the messages.
import random
from commands import getstatusoutput as gso
def gen_string():
return str(random.random())
d = {}
for z in range(10**6):
d[gen_string()] = gen_string()
print gso('ps -eo %mem,cmd |grep test.py')[1]
On a one gig netbook:
0.4 vim test.py
0.1 /bin/bash -c time python test.py
11.7 /usr/bin/python2.6 test.py
0.1 sh -c { ps -eo %mem,cmd |grep test.py; } 2>&1
0.0 grep test.py
real 0m26.325s
user 0m25.945s
sys 0m0.377s
... So its using about 10% of 1 gig for 100,000 records
But it would also depend on how much data redundancy you have ...
Thanks to @Ignacio for intern() -
def procLog(logName, userDict):
inf = open(logName, 'r')
for ln in inf.readlines():
name,act = ln.split(':')
userDict[name] = intern(act)
inf.close()
return userDict
def doLogs(logNameList):
userDict = {}
for logName in logNameList:
userDict = procLog(logName, userDict)
return userDict
def writeOrderedLog(logName, userDict):
keylist = userDict.keys()
keylist.sort()
outf = open(logName,'w')
for k in keylist:
outf.write(k + ':' + userDict[k])
outf.close()
def main():
mylogs = ['log20101214', 'log20101215', 'log20101216']
d = doLogs(mylogs)
writeOrderedLog('cumulativeLog', d)
the question, then, is how much memory this will consume.
def makeUserName():
ch = random.choice
syl = ['ba','ma','ta','pre','re','cu','pro','do','tru','ho','cre','su','si','du','so','tri','be','hy','cy','ny','quo','po']
# 22**5 is about 5.1 million potential names
return ch(syl).title() + ch(syl) + ch(syl) + ch(syl) + ch(syl)
ch = random.choice
states = ['joined', 'added pic', 'added article', 'added comment', 'voted', 'logged out']
d = {}
t = []
for i in xrange(1000):
for j in xrange(8000):
d[makeUserName()] = ch(states)
t.append( (len(d), sys.getsizeof(d)) )
which results in
(horizontal axis = number of user names, vertical axis = memory usage in bytes) which is... slightly weird. It looks like a dictionary preallocates quite a lot of memory, then doubles it every time it gets too full?
Anyway, 4 million users takes just under 100MB of RAM - but it actually reallocates around 3 million users, 50MB, so if the doubling holds, you will need about 800MB of RAM to process 24 to 48 million users.
精彩评论