Suppose you have a table with (id,userid,timestamp)
From the data set I am after results like this:
- 300 users were active between Oct 1 and Nov 30 (query is done)
- 10 users were active less than 1 day a week (in other words, they were not active every week)
- Of the 10, 2 we开发者_JAVA技巧re active once every 2 weeks (strict 2 weeks)
- Of the 10, 8 were active once every 4 weeks (strict 4 weeks)
So the questions are:
- How do you get if a user was not active in all weeks?
- How do you get if a user was active once every 2 weeks?
You could join the list to itself and try to find entries that match your conditions. Something like that (without having checked it):
SELECT l1.userid FROM Logs as l1
INNER JOIN Logs as l2
ON l2.timestamp > l1.timestamp
AND l2.timestamp < l1.timestamp + "1 week"
AND l1.userid = l2.userid
edit:
Counting matched could help:
SELECT COUNT(l1.userid) as matches, l1.userid FROM Logs as l1
INNER JOIN Logs as l2
ON l1.userid = l2.userid
AND l2.timestampFake > l1.timestampFake
AND l2.timestampFake < l1.timestampFake + @interval
WHERE l1.timestampFake > @start AND l1.timestampFake < @end
GROUP BY l1.userid
Assuming activity (user_id, ts)
and the period of interest between @ts_start
and @ts_end
then
You could try
1) not active every week
SELECT user_id
FROM activity
WHERE CEILING(DATEDIFF(@ts_end,@ts_start)/7) <
(SELECT COUNT(*)
FROM (SELECT 1
FROM activity sub
WHERE ts BETWEEN @ts_start AND @ts_end
AND sub.user_id = activity.user_id
GROUP BY YEAR(ts), WEEK(ts)) x
)
2) active every two weeks
SELECT user_id
FROM activity
WHERE CEILING(DATEDIFF(@ts_end,@ts_start)/14) <
(SELECT COUNT(*)
FROM (SELECT 1
FROM activity
WHERE ts BETWEEN @ts_start AND @ts_end
AND sub.user_id = activity.user_id
GROUP BY YEAR(ts), WEEK(ts) DIV 2) x)
This is just a first idea and is not tested (also, the queries check if more than once a week and more than twice a week, replacing <
with =
should change them to exactly once a week and exactly twice a week, respectively)
EDIT: There were errors in queries above, edited.
Another idea is to transform the requirement - active every week (or more often!) means there is no week with no activity (which translates to find the maximum of difference of consecutive activity timestamps and see if it less or equal to 7 days; finding consecutive activity timestamp can be done by joining to self on all timestamps greater and finding the MIN of those)
SELECT user_id
FROM activity
WHERE 7 >=
(SELECT MAX(DATEDIFF(ts2,ts1))
FROM (SELECT a1.ts AS ts1, MIN(a2.ts) AS ts2
FROM activity a1
INNER JOIN activity a2 ON
a1.user_id = a2.user_id AND a1.ts < a2.ts
WHERE activity.user_id = a1.user_id AND
a1.ts BETWEEN @ts_start AND @ts_end AND
a2.ts BETWEEN @ts_start AND @ts_end AND) x )
Replace 7 with 14 for two weeks and play with conditions ( 7 < ... implies there was a gap longer then a week, so not active in all weeks) to turn a query from not active every week to active every week (two weeks).
EDIT2 It should be easy to change the above query to return maximum period of inactivity for each user
SELECT user_id, MAX(DATEDIFF(ts2,ts1))
FROM (SELECT a1.ts AS ts1, MIN(a2.ts) AS ts2, a1.user_id AS user_id
FROM activity a1
INNER JOIN activity a2 ON
a1.user_id = a2.user_id AND a1.ts < a2.ts
WHERE a1.ts BETWEEN @ts_start AND @ts_end AND
a2.ts BETWEEN @ts_start AND @ts_end AND
GROUP BY a1.user_id) x
which can then be subtotalled or grouped for reporting purposes.
EDIT3 The above queries seem to bother mysql as it seems (?) that correlation has problems in the WHERE section (which it should not, tested with postgres and similar queries run with no objections)
We can turn correlated conditions in WHERE part into JOIN easily, but during this I realised that a few simplifications can be made
SELECT user_id, COUNT(DISTINCT WEEK(ts))
FROM activity
WHERE ts BETWEEN @ts_start AND @ts_end
GROUP BY user_id, YEAR(ts), WEEK(ts))
HAVING COUNT(DISTINCT WEEK(ts)) > CEILING(DATEDIFF(@ts_end,@ts_start)/7)
The above query has problems with ranges over 1 year (you have to modify the count to do something like COUNT(DISTINCT YEAR(ts)*100+WEEK(ts))
in the having section, but I kept it like this since it might take advantage of index to count distinct values if the expression is simple). Also, it should be checked on ranges that span the end of the year - the week function might have a shorter/longer week in new years week, see details in docs.
Now I went to rewrite it again and it seems that the following should work cleaner and if there is an index on (user_id,ts) I believe it will be quite fast
SELECT user_id, COUNT(DISTINCT DATEDIFF(ts,@ts_start) DIV 7)
FROM activity
WHERE ts BETWEEN @ts_start AND @ts_end
GROUP BY user_id
HAVING COUNT(DISTINCT DATEDIFF(@ts_end,@ts_start) DIV 7) =
(DATEDIFF(@ts_end,@ts_start) DIV 7)
精彩评论