开发者

Merging 2 very large text files, update each line, without using memory

开发者 https://www.devze.com 2023-03-31 19:40 出处:网络
Say I\'ve got 2 text files with around 2 million lines each (~50-80MB file size each). The structure of both files is the same:

Say I've got 2 text files with around 2 million lines each (~50-80MB file size each). The structure of both files is the same:

Column1 Column2 Column3
...

Column 1 never changes, Column 2: the same value may not be in both files, and won't be in the same order for both files, Column3 is a number and will be different in every file.

I need to be able to merge them both into one file, matched by Column 2. If Column2 exists in both files, update Column3 by adding the values of Column3 from both files together.

If the开发者_如何转开发 files weren't so huge, I could easily do this in PHP by reading each line of both files into arrays and going from there, but doing so easily overloads the memory available.

Is there a way to do this with without loading each line into memory? I'm mostly familiar with PHP, but open to Python, Java or Shell scripts if they are not too complicated to understand.


I'd go with command line sort(1)to merge and sort the files . After that, it should be a simple script to compute the sums. I don't know PHP, so I'll give my example in python:

sort -k2 <file1> <file2> | python -c "
  import itertools,sys
  allLines = (x.strip().split(' ') for x in sys.stdin)
  groups = itertools.groupby(allLines, lambda x:x[1])
  for k,lines in groups:
      firstLine = iter(g).next()
      print firstLine[0], firstline[1], sum(int(x[2]) for x in lines)
"


Ok, so if I'm reading this right, you'll have:

file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  <-- common column 2
abc 12 67  <---common column 2
abc 23 1   <-- unique column 2

output should be:

abc 12 101
abc 90 99

If that's the case, then something like this (assuming they're .csv-formatted):

$f1 = fopen('file1.txt', 'rb');
$f2 = fopen('file2.txt', 'rb');
$fout = fopen('outputxt.');

$data = array();
while(1) {
    if (feof($line1) || feof($line2)) {
        break; // quit if we hit the end of either file
    }

    $line1 = fgetcsv($f1);
    if (isset($data[$line1[1]])) {
       // saw the col2 value earlier, so do the math for the output file:
       $col3 = $line1[2] + $data[$line1[1]];
       $output = array($line[0], $line1[1], $col3);
       fputcsv($fout, $output);
       unset($data[$line1[1]]);
    } else {
       $data[$line1[1]] = $line1; // cache the line, if the col2 value wasn't seen already
    }

    $line2 = fgetcsv($f2);
    if (isset($data[$line2[1]])) {
       $col3 = $data[$line2[1]] + $line2[2];
       $newdata = array($line2[0], $line2[1], $col3);
       fputcsv($fout, $newdata);
       unset($data[$line2[1]]); // remove line from cache
    } else {
       $data[$line2[1]] = $line2;
    }
}

fclose($f1);
fclose($f2);
fclose($fout);

This is going off the top of my head, not tested, probably won't work, YMMV, etc...

It'd simplify things immensely if you pre-sort the two input files, so that column2 is used as the sort key. That'd keep the cache size down, as you'd know if you'd seen a matched value already and when to dump the earlier cached data.


What may throwing you is that you are looking at two files. There's no need for that. To use Mark's excellent example: file1:

abc 12 34
abc 56 78
abc 90 12

file2:

abc 90 87  
abc 12 67  
abc 23 1  

then

sort file1 file2 > file3

yields file3:

abc 12 34
abc 12 67  
abc 23 1
abc 56 78
abc 90 12
abc 90 87  

Second week of CS-101 to reduce that down to its final form.


You can solve it easily with Python sqlite3 included module without using much memory (around 13 Mb with 1 million rows):

import sqlite3

files = ("f1.txt", "f2.txt")    # Files to compare

# # Create test data
# for file_ in files:
#   f = open(file_, "w")
#   fld2 = 0
#   for fld1 in "abc def ghi jkl".split():
#       for fld3 in range(1000000 / 4):
#           fld2 += 1
#           f.write("%s %s %s\n" % (fld1, fld2, 1))
# 
#   f.close()

sqlite_file = "./join.tmp"      # or :memory: if you don't want to create a file

cnx = sqlite3.connect(sqlite_file)

for file_ in range(len(files)):     # Create & load tables
    table = "file%d" % (file_+1)
    cnx.execute("drop table if exists %s" % table)
    cnx.execute("create table %s (fld1 text, fld2 int primary key, fld3 int)" % table)

    for line in open(files[file_], "r"):
        cnx.execute("insert into %s values (?,?,?)" % table, line.split())

# Join & result
cur = cnx.execute("select f1.fld1, f1.fld2, (f1.fld3+f2.fld3) from file1 f1 join file2 f2 on f1.fld2==f2.fld2")
while True:
    row = cur.fetchone()
    if not row:
        break

    print row[0], row[1], row[2]

cnx.close()


PHP's memory_limit is appropriate for its primary intended task of web server scripting. It is wildly inappropriate for batch processing data, like the job you're trying to do. The problem is PHP's configured memory_limit, not that you're trying to do something that needs "too much" memory. My phone has easily enough memory to just load 2 80Mb files into memory and do this the fast/easy way, let alone any sort of real computer which should be able to load gigabytes (or at least 1 GB) of data without breaking a sweat.

Apparently you can set PHP's memory_limit (which is arbitrary and very small by today's standards) at runtime with ini_set, just for this script. Do you know how much memory you actually have available on the server? I know a lot of shared hosting providers do give you very small amounts of memory by today's standards, because they don't expect you to be doing much more than processing web page requests. But you probably can just do this directly in PHP the way you want to, without jumping through hoops (and massively slowing down the process) to try to avoid loading all of the files into memory at once.

0

精彩评论

暂无评论...
验证码 换一张
取 消