Database Compression in Python_问答_开发者_运维开发者技术经验分享

I have hourly logs like

user1:joined
user2:log out
user1:added pic
user1:added comment
user3:joined

I want to compress all the flat files down to one file. There are around 30 million users in the logs and I just want the latest user log for all th开发者_运维知识库e logs.

My end result is I want to have a log look like

user1:added comment
user2:log out
user3:joined

Now my first attempt on a small scale was to just do a dict like

log['user1'] = "added comment"

Will doing a dict of 30 million key/val pairs have a giant memory footprint.. Or should I use something like sqllite to store them.. then just put the contents of the sqllite table back into a file?

If you intern() each log entry then you'll use only one string for each similar log entry regardless of the number of times it shows up, thereby lowering memory usage a lot.

>>> a = 'foo'
>>> b = 'foo'
>>> a is b
True
>>> b = 'f' + ('oo',)[0]
>>> a is b
False
>>> a = intern('foo')
>>> b = intern('f' + ('oo',)[0])
>>> a is b
True

You could also process the log lines in reverse -- then use a set to keep track of which users you've seen:

s = set()

# note, this piece is inefficient in that I'm reading all the lines
# into memory in order to reverse them...  There are recipes out there
# for reading a file in reverse.
lines = open('log').readlines()
lines.reverse()

for line in lines:
    line = line.strip()
    user, op = line.split(':')
    if not user in s:
         print line
         s.add(user)

The various dbm modules (dbm in Python 3, or anydbm, gdbm, dbhash, etc. in Python 2) let you create simple databases of key to value mappings. They are stored on the disk so there is no huge memory impact. And you can store them as logs if you wish to.

This sounds like the perfect kind of problem for a Map/Reduce solution. See:

http://en.wikipedia.org/wiki/MapReduce
Hadoop

for example.

Its pretty to easy to mock up the data structure to see how much memory it would take.

Something like this where you could change gen_string to generate data that would approximate the messages.

import random
from commands import getstatusoutput as gso

def gen_string():
     return str(random.random())

 d = {}
 for z in range(10**6):
     d[gen_string()] = gen_string()

print gso('ps -eo %mem,cmd |grep test.py')[1]

On a one gig netbook:

  0.4 vim test.py
  0.1 /bin/bash -c time python test.py
 11.7 /usr/bin/python2.6 test.py
  0.1 sh -c { ps -eo %mem,cmd |grep test.py; } 2>&1
  0.0 grep test.py

   real    0m26.325s
   user    0m25.945s
   sys     0m0.377s

... So its using about 10% of 1 gig for 100,000 records

But it would also depend on how much data redundancy you have ...

Thanks to @Ignacio for intern() -

def procLog(logName, userDict):
    inf = open(logName, 'r')
    for ln in inf.readlines():
        name,act = ln.split(':')
        userDict[name] = intern(act)
    inf.close()
    return userDict

def doLogs(logNameList):
    userDict = {}
    for logName in logNameList:
        userDict = procLog(logName, userDict)
    return userDict

def writeOrderedLog(logName, userDict):
    keylist = userDict.keys()
    keylist.sort()
    outf = open(logName,'w')
    for k in keylist:
        outf.write(k + ':' + userDict[k])
    outf.close()

def main():
    mylogs = ['log20101214', 'log20101215', 'log20101216']
    d = doLogs(mylogs)
    writeOrderedLog('cumulativeLog', d)

the question, then, is how much memory this will consume.

def makeUserName():
    ch = random.choice
    syl = ['ba','ma','ta','pre','re','cu','pro','do','tru','ho','cre','su','si','du','so','tri','be','hy','cy','ny','quo','po']
    # 22**5 is about 5.1 million potential names
    return ch(syl).title() + ch(syl) + ch(syl) + ch(syl) + ch(syl)

ch = random.choice
states = ['joined', 'added pic', 'added article', 'added comment', 'voted', 'logged out']
d = {}
t = []
for i in xrange(1000):
    for j in xrange(8000):
        d[makeUserName()] = ch(states)
    t.append( (len(d), sys.getsizeof(d)) )

which results in

Database Compression in Python

(horizontal axis = number of user names, vertical axis = memory usage in bytes) which is... slightly weird. It looks like a dictionary preallocates quite a lot of memory, then doubles it every time it gets too full?

Anyway, 4 million users takes just under 100MB of RAM - but it actually reallocates around 3 million users, 50MB, so if the doubling holds, you will need about 800MB of RAM to process 24 to 48 million users.