I have a table as follows in Postgres SQL 8.4:
1 | John Smith | 2011-08-12 12:44:13.125+08
2 | John Smith | 2011-08-16 08:38:57.968+08
3 | John Smith | 2011-08-16 08:38:58.062+08
4 | Kenny Long | 2011-08-16 17:06:35.843+08
5 | Kenny Long | 2011-08-16 17:06:35.90开发者_如何学Go6+08
6 | Kenny Long | 2011-08-16 17:06:59.281+08
7 | Kenny Long | 2011-08-16 17:07:00.234+08
8 | Kenny Long | 2011-08-16 17:07:32.859+08
9 | Kenny Long | 2011-08-16 17:08:00.437+08
10 | Kenny Long | 2011-08-16 17:08:22.718+08
11 | Kenny Long | 2011-08-16 17:08:22.781+08
I would like to select the columns based timestamp. Only one row is needed for those records that fall within 2 minutes from each other. For example records number 4 to 9 should return only row number 4 and ignore the rest of the rows.
How can I achieve this? Your help is greatly appreciated.
Thank you in advance.
Joe Liew
I've tried it with a recursive way. I'm not sure it's the better way, and I'm quite sure I should study some Window operations to reduce it.
But It worked on my test case. The goal is to start with one min timestamp per guy, then track which rows are to be deleted (within the 2 min range), and which row is the next valid one. Then at each iteration we continue from this valid row (one per guy).
So here is the query for table myschema.mytable with colums id,name,tm. Note that the level column is there just to track recursion and debug, not necessary:
WITH RECURSIVE mytmp(id,name,thetime,thelevel) AS (
-- recursive query: 1st row
-- starting point, one row of the table for each people
-- with a subquery to get the min time with id, maybe a better way to do it
(
select myschema.mytable.id,myschema.mytable.name,myschema.mytable.tm as thetime,1 as thelevel
from (
select name,min(tm) as mintm
from myschema.mytable
group by name
) q,myschema.mytable
WHERE myschema.mytable.name=q.name
AND myschema.mytable.tm=q.mintm
ORDER BY name ASC) -- end of starting point of recursive query
UNION ALL
-- now the recursive part, starting from the 1st row and then again and again (loop)
-- get descendants in the 2 minutes interval for every computed row already in mytmp
--
-- get from previous iterations targets, one per guy
-- and track the 1st new valid row (>2min) for that guy
-- removing bad rows (<2min) is easy, several way to do it
-- keeping only one valid row (and not all the others is harder, limit and aggregates functions are restricted in recursive terms
-- we must keep only one, as the future valid rows will depend on the 2 minutes range from this one
-- maybe some window function could help me, but at least I've a working solution
select myschema.mytable.id,myschema.mytable.name,myschema.mytable.tm as thetime,q2.thelevel
FROM myschema.mytable,(
-- here need to keep 1st true one
select myschema.mytable.name,MIN(myschema.mytable.tm) as tm,mytmp2.thelevel +1 as thelevel
FROM myschema.mytable,(
select id,name,thetime,thelevel
from mytmp
) mytmp2
-- hack: mytmp2 is useless, mytmp should have been used
-- we create this indirection to avoid this message:
-- "ERROR: aggregate functions not allowed in a recursive query's recursive term"
-- on the MIN functions
-- I do not know why it worked :-)
WHERE myschema.mytable.name=mytmp2.name
-- future
AND myschema.mytable.tm - mytmp2.thetime > INTERVAL '0'
GROUP BY
-- hack the group by, to make 2 groups
-- the first one for rows in the 2 min range and the second one for others
CASE WHEN ((myschema.mytable.tm - mytmp2.thetime) > INTERVAL '2 minutes') THEN 1 ELSE 2 END,
myschema.mytable.name,mytmp2.thelevel,mytmp2.thetime
-- then with the having we keep only the second group, containing the first valid > 2min row
HAVING ((MIN(myschema.mytable.tm) - mytmp2.thetime) > INTERVAL '2 minutes')=true
) q2 -- q2contains 1st true row and all false rows for each people
-- q2 is used to get the id, that we cannot have in a group by request
WHERE q2.tm=myschema.mytable.tm
AND q2.name=myschema.mytable.name
) -- end of recursive query
SELECT *
FROM mytmp
ORDER BY name asc, thelevel asc,thetime asc
-- LIMIT 100 -- to debug, avoid infinite loops
Another solution would maybe be using a stored procedure, doing the same things in a temporary table (take valid rows, delete rows in the 2min range, then take the next valid ones, etc), maybe easier to maintain.
Just some idea, not tested. Windowing function need 8.4 / later.
SELECT * FROM
(SELECT
name,
tm,
case
when lagname is NULL OR -- first row of everything
(name <> lagname) OR -- we have order by name, this is first row of this name
(name = lagname AND lagtm + interval '2 minutes' >= tm)
then 1
else 0
end as flags
FROM
(
SELECT name,
tm,
lag(name) over (order by name,tm) as lagname,
lag(tm) over (order by name,tm) as lagtm
from "table"."table"
) AS lagtable
) AS blar
WHERE "flags" = 1
精彩评论