How to see if a user was not active in all weeks/or active once every 2/4 weeks in a date range?_问答_开发者

Suppose you have a table with (id,userid,timestamp)

From the data set I am after results like this:

300 users were active between Oct 1 and Nov 30 (query is done)
10 users were active less than 1 day a week (in other words, they were not active every week)
Of the 10, 2 we开发者_JAVA技巧re active once every 2 weeks (strict 2 weeks)
Of the 10, 8 were active once every 4 weeks (strict 4 weeks)

So the questions are:

How do you get if a user was not active in all weeks?
How do you get if a user was active once every 2 weeks?

You could join the list to itself and try to find entries that match your conditions. Something like that (without having checked it):

SELECT l1.userid FROM Logs as l1
INNER JOIN Logs as l2
  ON  l2.timestamp > l1.timestamp
  AND l2.timestamp < l1.timestamp + "1 week"
  AND l1.userid = l2.userid

edit:

Counting matched could help:

SELECT COUNT(l1.userid) as matches, l1.userid FROM Logs as l1
INNER JOIN Logs as l2
  ON l1.userid = l2.userid
  AND l2.timestampFake > l1.timestampFake
  AND l2.timestampFake < l1.timestampFake + @interval
WHERE l1.timestampFake > @start AND l1.timestampFake < @end
GROUP BY l1.userid

Assuming activity (user_id, ts) and the period of interest between @ts_start and @ts_end then

You could try
1) not active every week

SELECT user_id 
FROM activity
WHERE CEILING(DATEDIFF(@ts_end,@ts_start)/7) < 
      (SELECT COUNT(*) 
       FROM (SELECT 1
             FROM activity sub
             WHERE ts BETWEEN @ts_start AND @ts_end
                   AND sub.user_id = activity.user_id
             GROUP BY YEAR(ts), WEEK(ts)) x
      )

2) active every two weeks

SELECT user_id 
FROM activity
WHERE CEILING(DATEDIFF(@ts_end,@ts_start)/14) < 
      (SELECT COUNT(*) 
       FROM (SELECT 1
             FROM activity
             WHERE ts BETWEEN @ts_start AND @ts_end
                   AND sub.user_id = activity.user_id
             GROUP BY YEAR(ts), WEEK(ts) DIV 2) x)

This is just a first idea and is not tested (also, the queries check if more than once a week and more than twice a week, replacing < with = should change them to exactly once a week and exactly twice a week, respectively)

EDIT: There were errors in queries above, edited.

Another idea is to transform the requirement - active every week (or more often!) means there is no week with no activity (which translates to find the maximum of difference of consecutive activity timestamps and see if it less or equal to 7 days; finding consecutive activity timestamp can be done by joining to self on all timestamps greater and finding the MIN of those)

SELECT user_id
FROM activity
WHERE 7 >=
    (SELECT MAX(DATEDIFF(ts2,ts1))
     FROM (SELECT a1.ts AS ts1, MIN(a2.ts) AS ts2
           FROM activity a1
                INNER JOIN activity a2 ON
                     a1.user_id = a2.user_id AND a1.ts < a2.ts
           WHERE activity.user_id = a1.user_id AND
                 a1.ts BETWEEN @ts_start AND @ts_end AND
                 a2.ts BETWEEN @ts_start AND @ts_end AND) x )

Replace 7 with 14 for two weeks and play with conditions ( 7 < ... implies there was a gap longer then a week, so not active in all weeks) to turn a query from not active every week to active every week (two weeks).

EDIT2 It should be easy to change the above query to return maximum period of inactivity for each user

 SELECT user_id, MAX(DATEDIFF(ts2,ts1)) 
 FROM (SELECT a1.ts AS ts1, MIN(a2.ts) AS ts2, a1.user_id AS user_id
       FROM activity a1
            INNER JOIN activity a2 ON
                 a1.user_id = a2.user_id AND a1.ts < a2.ts
       WHERE a1.ts BETWEEN @ts_start AND @ts_end AND
             a2.ts BETWEEN @ts_start AND @ts_end AND
       GROUP BY a1.user_id) x

which can then be subtotalled or grouped for reporting purposes.

EDIT3 The above queries seem to bother mysql as it seems (?) that correlation has problems in the WHERE section (which it should not, tested with postgres and similar queries run with no objections)

We can turn correlated conditions in WHERE part into JOIN easily, but during this I realised that a few simplifications can be made

SELECT user_id, COUNT(DISTINCT WEEK(ts)) 
FROM activity
WHERE ts BETWEEN @ts_start AND @ts_end
GROUP BY user_id, YEAR(ts), WEEK(ts))
HAVING COUNT(DISTINCT WEEK(ts)) > CEILING(DATEDIFF(@ts_end,@ts_start)/7)

The above query has problems with ranges over 1 year (you have to modify the count to do something like COUNT(DISTINCT YEAR(ts)*100+WEEK(ts)) in the having section, but I kept it like this since it might take advantage of index to count distinct values if the expression is simple). Also, it should be checked on ranges that span the end of the year - the week function might have a shorter/longer week in new years week, see details in docs.

Now I went to rewrite it again and it seems that the following should work cleaner and if there is an index on (user_id,ts) I believe it will be quite fast

SELECT user_id, COUNT(DISTINCT DATEDIFF(ts,@ts_start) DIV 7) 
FROM activity
WHERE ts BETWEEN @ts_start AND @ts_end
GROUP BY user_id
HAVING COUNT(DISTINCT DATEDIFF(@ts_end,@ts_start) DIV 7) = 
       (DATEDIFF(@ts_end,@ts_start) DIV 7)