doing a pivot-table-ish JOIN in SQL_问答_开发者

My employer has a batch compute cluster that processes jobs submitted by开发者_如何学Go users. Each batch job consists of three steps:

job started
job finished
results reported to the user

The batch job management software logs when each of these steps happen, and the logfile consists of a tuple with an ID code of the employee who submitted the job, what step occurred, and a timestamp of when it occurred. In CSV, it looks like:

ID  step    timestamp
--  ------  ---------
A   start   3533
B   start   3538
B   finish  3549
C   start   3551
A   finish  3557
B   report  3559
C   finish  3602
A   report  3603
B   start   3611
C   report  3623
B   finish  3643
B   report  3657

And so forth.

One additional characteristic of the dataset is that there is concurrence between employees, but no concurrence within employees; ie, each employee has to wait until their current job has reported before their next job starts. So when I sort by date and limit the results to a single employee, the records always come out in the order "start","finish","report".

I want to create a pivot table that groups each job into a single row. So the above data becomes:

employee-ID  started  finished  reported
-----------  -------  --------  --------
A            3533     3557      3603
B            3538     3549      3559
B            3611     3643      3657
C            3551     3602      3623

So, on to the SQL:

SELECT
    log.ID AS employee-ID,
    start.timestamp AS started,
    finish.timestamp AS finished,
    report.timestamp AS reported
FROM
    log

    LEFT OUTER JOIN log start ON
    log.ID = start.ID
        AND start.step = 'start'

    LEFT OUTER JOIN log finish ON
    log.ID = finish.ID
        AND finish.step = 'finish'
        AND start.timestamp < finish.timestamp

    LEFT OUTER JOIN log report ON
    log.ID = report.ID
        AND report.step = 'report'
        AND finish.timestamp < report.timestamp

ORDER BY employee-ID,started,finished,reported;

I do need LEFT OUTER JOIN, because I also need to identify the jobs that were started but were not finished or reported.

This works pretty well. It does give me the rows that I need. But it gives me a lot of spurious rows, because the JOINs match finish and report entries for future jobs of the same employee in addition to the current job. So the report comes out looking like:

employee-ID  started  finished  reported
-----------  -------  --------  --------
A            3533     3557      3603
B            3538     3549      3559
B            3538     3549      3657 <-- spurious
B            3538     3643      3657 <-- spurious
B            3611     3643      3657
C            3551     3602      3623

It's easy to recognize the spurious rows: each job gets started only once, so given the sorting, the correct row is the first row with a unique "started" value. I'm working around the spurious rows problem right now at the application level by just skipping over the spurious rows, but that just seems, well, inelegant. And it's costly: some of these employees have dozens of jobs submitted, so currently, the results of my queries are about 15% legitimate entries and 85% spurious. That's a lot of wasted time skipping over bogus entries. It would be nice if each job had a unique ID, but I just don't have that data.

I need to somehow limit the JOIN so that it picks off only one "finished" and "reported" entry for each "started" entry: the single entry that has the minimum timestamp greater than the timestamp of the preceding step. I tried doing this by using a subquery as the table to which I was JOINing, but I couldn't figure out how to do it without a correlated subquery. I also tried doing it by using "GROUP BY employee-ID,started", but this didn't necessarily pick the "correct" row. Most of the rows the "GROUP BY" reported were the wrong ones.

So, SQL gurus, is it possible to report just the rows I need to? And if so, how? I am using sqlite3 right now, but could transfer the database to MySQL if needed.

The problem is how you're joining to finish and report

You don't want the start.timestamp < finish.timestamp you really want start.timestamp < MIN(finish.timestamp)

Of course that doesn't work so you'll have to do it after the join.

e.g.

SELECT
    log.ID AS employee_ID,
    start.timestamp AS started,
    MIN(finish.timestamp) AS finished,
    MIN(report.timestamp) AS reported
FROM
    log


LEFT OUTER JOIN log start ON
log.ID = start.ID
    AND start.step = 'start'

LEFT OUTER JOIN log finish ON
log.ID = finish.ID
    AND finish.step = 'finish'
    AND start.timestamp < finish.timestamp

LEFT OUTER JOIN log report ON
log.ID = report.ID
    AND report.step = 'report'
    AND finish.timestamp < report.timestamp

GROUP BY log.ID,
    start.timestamp 
ORDER BY 
    employee_ID,started,finished,reported

Also you could probably convert the start to an inner join as it doesn't make a whole lot of sense to have a finish without a start