开发者

Average of rows where column = A within distinct rows on another column grouped by a third column

开发者 https://www.devze.com 2023-03-10 15:24 出处:网络
Using SQL Server, I\'m trying to query a kind of averaged count from a table I didn\'t design, where basically I want a list, grouped by one column, with the number of distinct values of another colum

Using SQL Server, I'm trying to query a kind of averaged count from a table I didn't design, where basically I want a list, grouped by one column, with the number of distinct values of another column matching a given criterion, and of those, the number of rows matching another criterion (which I'll use to created the averaged count or whatever it is). This can't be hard, but I'm having a bad set theory day and any pointers will be gratefully received.

Here's the simplified and genericized scenario (schema and sample data below). Say we have three columns:

  • objid (has a clustered index)
  • userid (no index, I might be able to add one)
  • actiontype (no index, I might be able to add one)

None of these is unique, and none can be null. We want to completely ignore any rows where actiontype is none. We want to know, per userid, how many actiontype = 'flag' rows there are on average per object that user has interacted with.

So if we have "ahmed", "joe", and "maria", and joe interacted with 3 objects and raised 5 flags, the number there is 5 / 3 = 1.6666 continuous; if "ahmed" interacted with 3 objects and didn't raise any flags, his number would be 0; if "maria" interacted with 5 objects and raised 4 flags, her number would be 4 / 5 = 0.8:

+--------+------------------+
| userid | flags_per_object |
+--------+------------------+
| ahmed  | 0                |
| joe    | 1.66666667       |
| maria  | 0.8              |
+--------+------------------+

I won't be remotely surprised if this is closed as a duplicate, I'm just not finding it.

Here's the simplified table setup and sample data:

create table tmp
(
    objid      varchar(254) not null,
    userid     varchar(254) not null,
    actiontype varchar(254) not null
)
create clustered index tmp_objid on tmp(objid)

insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('开发者_如何学Calpha', 'joe', 'update')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'close')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'flag')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'flag')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'flag')
insert into tmp (objid, userid, actiontype) values ('alpha', 'joe', 'flag')

insert into tmp (objid, userid, actiontype) values ('beta', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('beta', 'joe', 'none')
insert into tmp (objid, userid, actiontype) values ('beta', 'joe', 'close')
insert into tmp (objid, userid, actiontype) values ('beta', 'joe', 'flag')

insert into tmp (objid, userid, actiontype) values ('gamma', 'joe', 'none')

insert into tmp (objid, userid, actiontype) values ('delta', 'joe', 'update')

insert into tmp (objid, userid, actiontype) values ('alpha', 'maria', 'update')

insert into tmp (objid, userid, actiontype) values ('beta', 'maria', 'flag')
insert into tmp (objid, userid, actiontype) values ('beta', 'maria', 'flag')

insert into tmp (objid, userid, actiontype) values ('gamma', 'maria', 'flag')
insert into tmp (objid, userid, actiontype) values ('gamma', 'maria', 'flag')
insert into tmp (objid, userid, actiontype) values ('gamma', 'maria', 'update')
insert into tmp (objid, userid, actiontype) values ('gamma', 'maria', 'close')

insert into tmp (objid, userid, actiontype) values ('delta', 'maria', 'update')
insert into tmp (objid, userid, actiontype) values ('epsilon', 'maria', 'update')

insert into tmp (objid, userid, actiontype) values ('alpha', 'ahmed', 'none')

insert into tmp (objid, userid, actiontype) values ('beta', 'ahmed', 'none')

insert into tmp (objid, userid, actiontype) values ('gamma', 'ahmed', 'none')
insert into tmp (objid, userid, actiontype) values ('gamma', 'ahmed', 'update')

insert into tmp (objid, userid, actiontype) values ('delta', 'ahmed', 'update')
insert into tmp (objid, userid, actiontype) values ('delta', 'ahmed', 'close')

insert into tmp (objid, userid, actiontype) values ('epsilon', 'ahmed', 'update')
insert into tmp (objid, userid, actiontype) values ('epsilon', 'ahmed', 'close')


You can try the following :

select  t1.userid,
CASE cnt2 
WHEN 0 THEN 0
ELSE ISNULL(cast(cnt2 as float)/cnt1,0)
END as num
FROM
(
  select userid, COUNT(distinct(t1.objid)) as cnt2
  from tmp as t1
  where t1.actiontype <> 'none'
  group by t1.userid
) t1

LEFT JOIN (
SELECT t2.userid, COUNT(*) as cnt1
FROM tmp as t2
WHERE t2.actiontype='flag'
GROUP BY t2.userid)b ON (b.userid = t1.userid)

Even though it looks uglier than your solution, it surprisingly generates a better execution plan based on test data you provided.


(Answering my own question.)

I do have something that works:

select   userid,
         cast(count(case when actiontype = 'flag' then 1 else null end) as float)
         /
         count(distinct(objid))
         as flags_per_object
from     tmp
where    actiontype <> 'none'
group by userid

....but I can't help feeling there's a better way...


;with cte as
(
  select userid,
         sum(case actiontype when 'flag'
               then 1
               else 0
             end) as Flags
  from tmp
  where actiontype <> 'none'
  group by userid, [objid]
)
select userid,
       cast(sum(Flags) as float)/count(*) as flags_per_object
from cte
group by userid

As a sub query instead of CTE

select userid,
       cast(sum(Flags) as float)/count(*) as flags_per_object
from (select userid,
             sum(case actiontype when 'flag'
                   then 1
                   else 0
                 end) as Flags
      from tmp
      where actiontype <> 'none'
      group by userid, [objid]) as T
group by userid


The answer is: It depends.

In my testing, my solution is the slowest of the bunch, regardless of what test data I use. With real life data, it's about half the speed of the fastest solution.

Mikael's solution is faster for the test data quoted in my question, and faster for a larger-but-still-small data set (our testing system, about 2k rows) in my real-life tables.

But a1ex07's solution is faster for my full-size real-life tables (our live system, about 700k rows). There's not a lot of distance between a1ex07's and Mikael's, but a1ex07's definitely has the edge.

I ended up actually using Mikael's solution, though, because it's easier to conceptualize if you're not a l33t DB person (and the people doing maintenance on this code, of which the SQL is only a small part, won't be) and easier to adapt to various other scenarios.

Thus this community wiki meta-answer, which I'll accept when the time limit passes, rather than accepting either of their excellent answers. If you found this helpful, please do vote up both Mikael's answer and a1ex07's answer, as I have done.


SELECT
  userid,
  flags_per_object =
    COUNT(CASE actiontype WHEN 'flag' THEN objid END) *
    1.0 / COUNT(DISTINCT objid)
FROM
  tmp
WHERE actiontype <> 'none'
GROUP BY userid
0

精彩评论

暂无评论...
验证码 换一张
取 消