开发者

data cleansing using grouping based on time interval - sql2005

开发者 https://www.devze.com 2023-02-03 15:57 出处:网络
I have the following data in a 开发者_运维知识库table that I would like to report on without having to delete any rows.

I have the following data in a 开发者_运维知识库table that I would like to report on without having to delete any rows.

ActiveSearchID---SearchDate---------------------SearchPhrase
1---------------------2010-12-15 12:01:11.587---argos
2---------------------2010-12-15 12:03:40.193---muji
3---------------------2010-12-15 12:03:42.370---muji
4---------------------2010-12-15 12:04:29.167---Office supplies
5---------------------2010-12-15 12:05:11.590---lava
9---------------------2010-12-15 12:08:38.920---sony vaio
10-------------------2010-12-15 12:08:41.170---sony vaio
12-------------------2010-12-15 12:09:09.920---sony vaio battery
13-------------------2010-12-15 12:09:17.487---sony vaio battery
14-------------------2010-12-15 12:17:10.980---sony vaio battery
15-------------------2010-12-15 12:17:12.170---argos

The report I am trying to get is to select the first instance of a searchphrase that has been searched on within a 5 minute interval. So for example the query no the information above would result in the following:

SearchDate----------------SearchPhrase

2010-12-15 12:01:11.587---argos

2010-12-15 12:03:40.193---muji

2010-12-15 12:04:29.167---Office supplies

2010-12-15 12:05:11.590---lava

2010-12-15 12:08:38.920---sony vaio

2010-12-15 12:09:09.920---sony vaio battery
2010-12-15 12:17:12.170---argos

i've tried the following query but i am still getting duplicates:

select t1.searchdate, t1.searchphrase from activesearches t1 inner join activesearches t2 on t1.searchphrase = t2.searchphrase and t1.searchdate < t2.searchdate where datediff(s, t1.searchdate, t2.searchdate) <= 300 order by searchdate

I would like to use the "WITH SearchPhrases AS ()" type of query but I just can't get my head around it.

Thanks


I believe given your test data "sony vaio battery" should have been returned twice. I came up with two options.

-- Populate test data
if(OBJECT_ID('tempdb..#Search') IS NOT NULL)
    DROP TABLE #Search
create table #Search (
    ActiveSearchID int primary key, 
    SearchDate datetime not null, 
    SearchPhrase nvarchar(30))

insert into #Search(ActiveSearchID, SearchDate, SearchPhrase)
select 1, '2010-12-15 12:01:11.587', 'argos'
union all select 2, '2010-12-15 12:03:40.193', 'muji'
union all select 3, '2010-12-15 12:03:42.370', 'muji'
union all select 4, '2010-12-15 12:04:29.167', 'Office supplies'
union all select 5, '2010-12-15 12:05:11.590', 'lava'
union all select 9, '2010-12-15 12:08:38.920', 'sony vaio'
union all select 10, '2010-12-15 12:08:41.170', 'sony vaio'
union all select 12, '2010-12-15 12:09:09.920', 'sony vaio battery'
union all select 13, '2010-12-15 12:09:17.487', 'sony vaio battery'
union all select 14, '2010-12-15 12:17:10.980', 'sony vaio battery'
union all select 15, '2010-12-15 12:17:12.170', 'argos'

I think you are looking for something like this query. I am not sure how this will perform yet though:

select * 
from #Search as S
where not exists(
select * from #Search as N
where N.SearchPhrase= S.SearchPhrase
and N.SearchDate between 
    dateadd(minute, -5, S.SearchDate) AND S.SearchDate
and N.ActiveSearchID <> S.ActiveSearchID)

Alternatively, if you can use discreet 5 minute intervals on the clock, this may perform better - I have not tested either with a large amount of data:

select
    ActiveSearchID, SearchDate, SearchPhrase
from
(
    select 
        *,
        ROW_NUMBER() over (
                partition by SearchPhrase,  
                             DATEDIFF(minute, '2000-01-01', SearchDate) / 5
            order by SearchDate, ActiveSearchID) as rn,
        DATEDIFF(minute, '2000-01-01', SearchDate) as five_minute_window 
    from #Search
) as X
where
    rn = 1
order by
    ActiveSearchID
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号