display records which exist in file2 but not in file1_问答_开发者

display records which exist in file2 but not in file1

开发者 https://www.devze.com 2022-12-22 22:36 出处：网络

log file1开发者_运维问答 contains records of customers(name,id,date) who visited yesterday log file2 contains records of customers(name,id,date) who visited today

How would you display customers who visited yesterday but not today?

Constraint is: Don't use auxiliary data structure because file contains millions of records. [So, no hashes] Is there a way to do this using Unix commands ??

an example, but check the man page of comm for the option you want.

comm -2 <(sort -u yesterday) <(sort -u today)

The other tool you can use is diff

diff <(sort -u yesterday) <(sort -u today)

I was personally going for the creating a data structure and records of visits, but, I can see how you'd do it another way too.

In pseudocode, that looks something like python but could be re-written in perl or shell script or ...

import subprocess 
import os

for line in fileinput.input(['myfile'])::
    # split out data. For the sake of it I'm assuming name\tid\tdate
    fields = line.split("\")
    id = fields[1]

    grepresult = subprocess.Popen("grep \"" + id + "\" file1", shell=True, bufsize=bufsize, stdout=PIPE).stdout

    if len(grepresult) == 0:
        print fields # it wasn't in field1

That's not perfect, not tested so treat appropriately but it gives you the gist of how you'd use unix commands. That said, as sfussenegger points out C/C++ if that's what you're using should be able to handle pretty large files.

Disclaimer: this is a not so neat solution (repeatedly calling grep) to match the requirements of the question. If I was doing it, I would use C.

Is a customer identified by id? Is it an int or long? If the answer to both questions is yes, an array with 10,000,000 integers shouldn't take more than 10M*4 = 40MB memory - not a big deal on decent hardware. Simply sort and compare them.

btw, sorting an array with 10M random ints takes less than 2 seconds on my machine - again, nothing to be afraid of.

Here's some very simple Java code:

public static void main(final String args[]) throws Exception {

    // elements in each log file
    int count = 10000000;

    // "read" our log file
    Random r = new Random();
    int[] a1 = new int[count];
    int[] a2 = new int[count];
    for (int i = 0; i < count; i++) {
        a1[i] = Math.abs(r.nextInt());
        a2[i] = Math.abs(r.nextInt());
    }

    // start timer
    long start = System.currentTimeMillis();

    // sort logs
    Arrays.sort(a1);
    Arrays.sort(a2);

    // counters for each array
    int i1 = 0, i2 = 0, i3 = 0;

    // initial values
    int n1 = a1[0], n2 = a2[0];

    // result array
    int[] a3 = new int[count];

    try {
        while (true) {
            if (n1 == n2) {
                // we found a match, save value if unique and increment counters
                if (i3 == 0 || a3[i3-1] != n1) a3[i3++] = n1;
                n1 = a1[i1++];
                n2 = a2[i2++];
            } else if (n1 < n2) {
                // n1 is lower, increment counter (next value is higher)
                n1 = a1[i1++];
            } else {
                // n2 is lower, increment counter (next value is higher)
                n2 = a2[i2++];
            }
        }
    } catch (ArrayIndexOutOfBoundsException e) {
        // don't try this at home - it's not the pretties way to leave the loop!
    }

    // we found our results
    System.out.println(i3 + " commont clients");
    System.out.println((System.currentTimeMillis() - start) + "ms");

}

result

// sample output on my machine:
46308 commont clients
3643ms

as you see, quite efficient for 10M records in each log

display records which exist in file2 but not in file1

精彩评论

关注公众号

热门标签

图文推荐

display records which exist in file2 but not in file1

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：