开发者

Select query for recurring dates with tally table

开发者 https://www.devze.com 2023-02-02 16:57 出处:网络
I have a table for events, with a field that specifies how frequently the event occurs (in days).I\'d like to select all occurrences of the event within a given date range, including the calculated oc

I have a table for events, with a field that specifies how frequently the event occurs (in days). I'd like to select all occurrences of the event within a given date range, including the calculated occurrences (e.g. if the first event date is Jan 6 2011 and it occurs every 7 days, you'd see Jan 13 and Jan 20 in the results).

Here's what my events table looks like:


event_ID INT,
event_title NVARCHAR(50),
first_event_date DATETIME,
occurs_every INT

After reading this article, it seems like the most efficient way to handle this is with a tally table, but I haven't been able to wrap my head around how to return the results I'm looking for.

Let's say I have data that looks like this:

event_ID | event_title | first_event_date | occurs_every
1        | Event 1     |  1/6/2011        |     7
2        | Event 2     |  1/8/2011        |     3

The results I'm looking for would be:

event_ID | event_title | event_date | 
1        | Event 1     |  1/6/2011  | 
2   开发者_JS百科     | Event 2     |  1/8/2011  | 
1        | Event 1     |  1/13/2011 | 
2        | Event 2     |  1/12/2011 | 
2        | Event 2     |  1/16/2011 | 
1        | Event 1     |  1/20/2011 | 
(etc)

Any suggestions? Edit: I'm using SQL Server 2008.

Additional info:

I've got a working query, but it seems pretty kludgy and I'm concerned about the performance once I get more data into the table.

First, for reference, this is the Tally table:


SELECT TOP 11000 
        IDENTITY(INT,1,1) AS N
   INTO dbo.Tally
   FROM Master.dbo.SysColumns sc1,
        Master.dbo.SysColumns sc2

  ALTER TABLE dbo.Tally
    ADD CONSTRAINT PK_Tally_N 
        PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

Now, here's the kludgy select query:

SELECT  event_ID,
        event_title,
        first_event_date,
        DATEADD(dd, occurs_every * ( t.N - 1 ), [first_event_date]) AS occurrence
FROM    dbo.Events
        CROSS JOIN dbo.Tally t
WHERE   DATEADD(dd, occurs_every * ( t.N - 1 ), [first_event_date]) <= '03-01-2011
ORDER BY occurrence`

Now, this works - but when I added 1000 rows of sample data to the table it really bogged down. I assume that's my cross join.


In SQL Server 2008 you can use a recursive CTE.

DECLARE @StartDate DATE, @EndDate DATE
SET @StartDate = '20110106'
SET @EndDate = '20110228';


WITH DateTable AS
(
    SELECT Event_id, event_title, event_date, occurs_every
    FROM tally_table
    UNION ALL
    SELECT event_ID, event_title, DATEADD(DAY,occurs_every,event_date), occurs_every
    FROM DateTable
    WHERE DATEADD(DAY,occurs_every,event_date) BETWEEN @StartDate AND @EndDate
)
SELECT Event_id, event_title, event_date
FROM DateTable
WHERE event_date BETWEEN @StartDate AND @EndDate
ORDER BY event_date

You have to remember to filter by the date range, so it doesn't get in a infinite loop. Or use the MAXRECURSION hint to limit the results (by default this value is 100)


First, please accept my most sincere apologies for not getting back to this post. I made a couple of comments as a preamble and with full intent to post a useful answer later instead of just “sage advice” and then real life happened and I totally lost track of this post.

Let’s first revisit the OP’s post by building the table he said he was using and populating it with a thousand events like he said he did. I’ll modernize the data a bit by using random start dates for 2015 and 2016 using a high performance “pseudo cursor” to provide the “presence of rows” that we need instead of the RBAR of either a While Loop or rCTE (Recursive CTE).

As a bit of a side bar, I’m keeping everything 2005 compatible because there are still a whole lot of people using 2005 and there’s no performance gain in using 2008+ techniques for this.

Here’s the code to build the test table. Details are in the comments.

--====================================================================
--      Presets
--====================================================================
--===== Declare and prepopulate some obviously named variables
DECLARE  @StartDate     DATETIME
        ,@EndDate       DATETIME
        ,@Days          INT
        ,@Events        INT
        ,@MaxEventGap   INT
;
 SELECT  @StartDate     = '2015-01-01' --Inclusive date
        ,@EndDate       = '2017-01-01' --Exclusive date 
        ,@Days          = DATEDIFF(dd,@StartDate,@EndDate)
        ,@Events        = 1000
        ,@MaxEventGap   = 30 --Note that 1 day will be the next day
;
--====================================================================
--      Create the Test Table
--====================================================================
--===== If the test table already exists, drop it to make reruns of 
     -- this demo easier. I also use a Temp Table so that we don't 
     -- accidenttly screw up a real table.
     IF OBJECT_ID('tempdb..#Events','U') IS NOT NULL
        DROP TABLE #Events
;
--===== Build the test table.
     -- I'm following what the OP did so that anyone with a case 
     -- sensitive server won't have a problem.
 CREATE TABLE #Events
        (
        event_ID            INT,
        event_title         NVARCHAR(50),
        first_event_date    DATETIME,
        occurs_every        INT
        )
;
--====================================================================
--      Populate the Test Table
--====================================================================
--===== Build @Events number of events using the previously defined
     -- start date and number of days as limits for the random dates.
     -- To make life a little easier, I'm using a CTE with a 
     -- "pseudo-cursor" to form most of the data and then an
     -- external INSERT so that I can name the event after the
     -- event_ID.
   WITH cteGenData AS
        (
         SELECT TOP (@Events)
                event_ID         = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
               ,first_event_date = DATEADD(dd, ABS(CHECKSUM(NEWID())) % @Days, @StartDate)
               ,occurs_every     = ABS(CHECKSUM(NEWID())) % 30 + 1
           FROM      sys.all_columns ac1 --Has at least 4000 rows in it for most editions
          CROSS JOIN sys.all_columns ac2 --Just in case it doesn't for Express ;-)
        )
 INSERT INTO #Events
        (event_ID, event_title, first_event_date, occurs_every)
 SELECT  event_ID
        ,event_title = 'Event #' + CAST(event_id AS VARCHAR(10))
        ,first_event_date
        ,occurs_every
   FROM cteGenData
;
--===== Let's see the first 10 rows
 SELECT TOP 10 *
   FROM #Events
  ORDER BY event_ID
;

Here are what the first 10 rows will look like with the understanding that the values for first_even_datet and occurs_every will be quite different because of the methods I used to generate constrained random data.

event_ID event_title first_event_date        occurs_every
-------- ----------- ----------------------- ------------
1        Event #1    2016-10-12 00:00:00.000 10
2        Event #2    2015-04-25 00:00:00.000 28
3        Event #3    2015-11-08 00:00:00.000 4
4        Event #4    2016-02-16 00:00:00.000 25
5        Event #5    2016-06-11 00:00:00.000 15
6        Event #6    2016-04-29 00:00:00.000 14
7        Event #7    2016-04-16 00:00:00.000 9
8        Event #8    2015-03-29 00:00:00.000 2
9        Event #9    2016-02-14 00:00:00.000 29
10       Event #10   2016-01-23 00:00:00.000 8

Just to be sure, you're going to need a Tally Table to duplicate the OPs experiment. Here's the code for that. If you already have one, make sure that it had the required unique clustered index (usually in the form of a PK) for performance reasons. I have modernized the row-source tables in the "pseudo-cursor" part of the code to NOT use the deprecated "syscolumns" view.

--===== Create a Tally Table with enough sequential numbers
     -- for more than 30 years worth of dates.
SELECT TOP 11000 
        IDENTITY(INT,1,1) AS N
   INTO dbo.Tally
   FROM      sys.all_columns sc1
  CROSS JOIN sys.all_columns sc2
;
--===== Add the quintessential Unique Clustered Index as the PK.
  ALTER TABLE dbo.Tally
    ADD CONSTRAINT PK_Tally_N 
        PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
;

We're ready to rock. A part of the OP's code got swallowed by the forum but I was able to recover it using an edit of his original post. It actually looks like this except that I changed the "end date" to match the data that I just generated (and that's the only change I made). Since the code contains no scalar or multi-statement UDFs, I also turned on statistics to try to explain what's going on.

Here's the OP's code with the changes mentioned.

SET STATISTICS TIME,IO ON
;
 SELECT event_id,
        event_title,
        first_event_date,
        DATEADD(dd, occurs_every * ( t.N - 1 ), [first_event_date]) AS Occurrence
   FROM #Events 
  CROSS JOIN dbo.Tally t
  WHERE t.N <= DATEDIFF(dd,first_event_date,'2017-03-01') / occurs_every + 1
  ORDER BY Occurrence
;
SET STATISTICS TIME,IO OFF
;

Here are the stats from running the OPs code. Sorry about all the scrolling but they're long lines.

(61766 row(s) affected)
Table 'Worktable'. Scan count 4, logical reads 118440, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tally'. Scan count 4, logical reads 80, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Events_____________________________________________________________________________________________________________00000000001F'. Scan count 5, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 4196 ms,  elapsed time = 1751 ms.

Obviously, that performance is making sucking sounds that even a While Loop or rCTE could beat. What IS the problem?

If you check out the highlighted arrow in the execution plan below, you find that it contains 11 MILLION actual rows because of the non-SARGable (SARG = "Search ARGument" and non-SARGable means it can't use an index properly) criteria that caused a full CROSS JOIN between the 11,000 row Tally Table and the 1,000 row #Events table. And those are ACTUAL rows, not ESTIMATED rows, folks.

Select query for recurring dates with tally table

The reason is because the "N" column of the Tally Table is used in a formula and the entire Tally Table must be scanned as a result for every row in the #Events table. This is a common error that makes people thing that Tally Tables produce slow code.

So, how do we fix it? Rather than using t.N to calculate a date for each row, let's take the difference of dates and divide by the number of days to figure out the number of occurrences needed to equate t.N to and see what happens. Note that the only thing I changed in the code below was the criteria in the WHERE clause to make the lookup on t.N SARGable (able to use an index to start and stop a seek followed by a range scan).

SET STATISTICS TIME,IO ON
;
 SELECT event_id,
        event_title,
        first_event_date,
        DATEADD(dd, occurs_every * ( t.N - 1 ), [first_event_date]) AS Occurrence
   FROM #Events 
  CROSS JOIN dbo.Tally t
  WHERE t.N <= DATEDIFF(dd,first_event_date,'2017-03-01') / occurs_every + 1
  ORDER BY Occurrence
;
SET STATISTICS TIME,IO OFF
;

Here's what the new execution plan looks like. 61,766 rows actual rows (all in cache) is a whole lot different than 11 MILLION rows.

Select query for recurring dates with tally table

Here are what the stats on that little slice of computational heaven look like.

(61766 row(s) affected)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table '#Events_____________________________________________________________________________________________________________00000000001F'. Scan count 5, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Tally'. Scan count 1000, logical reads 3011, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 78 ms,  elapsed time = 528 ms.
  • CPU time decreased by 52.79 times or 5,279%.
  • Elapsed time decreased by 2.32 times or 232%.
  • Total reads decreased by 38.27 times or 3,827%

Total amount of code changed... 1 line of the WHERE clause.

We could lower the total number of reads to just 7 by using one of Itzik Ben-Gan's inline cascading CTEs (is NOT an rCTE), as well.

The bottom line is that while the use of a Tally Table is nearly a panacea of performance, you do have to use it correctly, just like anything else. You have to use "Best Practices", such as writing a SARGable WHERE clause to get it to us the index correctly, just like anything else.

Again, my most sincere apologies, especially to the OP, for being so late with this. I hope it will help someone in the future. I also apologize for not having the time to rewrite the rCTE example on this thread to show how bad it can be. If you're interested in why rCTEs are so bad and you don't mind SQLServerCentral.com membership, then here's an article on the subject. I'd post all that here but it's too long to do so.

Hidden RBAR: Counting with Recursive CTE's


Here is one method using Oracle (you can switch this to other engines by modifying the sub-query that generates consecutive numbers, see below). The idea behind this query is to generate a consecutive list of multipliers (e.g. 0, 1, 2, 3..., n) up to the window size (days between dates). This is what the sub-query returns. We use this to cross join with the event table and then limit the results to the requested date range.

SELECT t.event_id, t.event_title, t.event_date + t.occurs_every*x.r event_date
FROM tally_table t CROSS JOIN (
 SELECT rownum-1 r FROM DUAL 
        connect by level <= (date '2011-1-20' - date '2011-1-6') + 1
 ) x
WHERE t.event_date + t.occurs_every*x.r <= date '2011-1-20'
ORDER BY t.event_date + t.occurs_every*x.r, t.event_id;

The tally_table in the query is the table you specified in your question.

0

精彩评论

暂无评论...
验证码 换一张
取 消