开发者

Recursive SQL query to speed up non-indexed query

开发者 https://www.devze.com 2023-01-20 04:07 出处:网络
This question is largely driven by curiosity, as I do have a working query (it just takes a litt开发者_JS百科le longer than I would like).

This question is largely driven by curiosity, as I do have a working query (it just takes a litt开发者_JS百科le longer than I would like).

I have a table with 4 million rows. The only index on this table is an auto-increment BigInt ID. The query is looking for distinct values in one of the columns, but only going back 1 day. Unfortunately, the ReportDate column that is evaluated is not of the DateTime type, or even a BigInt, but is char(8) in the format of YYYYMMDD. So the query is a bit slow.

  SELECT Category 
    FROM Reports 
   where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112) 
GROUP BY Category

Note that the date converstion in the above statement is simply converting it to a YYYYMMDD format for comparison.

I was wondering if there was a way to optimize this query based on the fact that I know that the only data I am interested in is at the "bottom" of the table. I was thinking of some sort of recursive SELECT function which gradually grew a temporary table that could be used for the final query.

For example, in psuedo-sql:

N = 128
TemporaryTable = SELECT TOP {N} * 
                   FROM Reports 
               ORDER BY ID DESC 

/* Once we hit a date < Today, we can stop */
if(TemporaryTable does not contain ReportDate < Today) 
  N = N**2
  Repeat Select

/* We now have a smallish table to do our query */
  SELECT Category 
    FROM TemproaryTable 
   where ReportDate = CONVERT(VARCHAR(8), GETDATE(), 112) 
GROUP BY Category

Does that make sense? Is something like that possible?

This is on MS SQL Server 2008.


I might suggest you do not need to convert the Date that is stored as char data in YYYYMMDD format; That format is inherently sortable all by itself. I would instead convert your date to output in that format.

Also, the way you have the conversion written, it is converting the current DateTime for every individual row, so even storing that value for the whole query could speed things up... but I think just converting the date you are searching for to that format of char would help.

I would also suggest getting the index(es) you need created, of course... but that's not the question you asked :P


Why not just create the index you need?

create index idx_Reports_ReportDate 
    on Reports(ReportDate, Category)


No, that doesn't make sense. The only way to optimize this query is to have a covering index for it:

CREATE INDEX ndxReportDateCategory ON Reports (ReportDate, Category);

Update

Considering your comment that you cannot modify the schema, then you should modify the schema. If you still can't, then the answer still applies: the solution is to have an index.

And finally, to answer more directly your question, if you have a strong correlation between ID and ReportData: the ID you seek is the biggest one that has a ReportDate smaller than the date you're after:

SELECT MAX(Id) 
FROM Reports
WHERE ReportDate < 'YYYYMMDD';

This will do a reverse scan on the ID index and stop at the first ID that is previous to your desired date (ie. will not scan the entire table). You can then filter your reports base don this found max Id.


I think you will find the discussion on SARGability, on Rob Farley's Blog to be very interesting reading in relation to your post topic.

http://blogs.lobsterpot.com.au/2010/01/22/sargable-functions-in-sql-server/

An interesting alternative approach that does not require you to modify the existing column data type would be to leverage computed columns.

alter table REPORTS
add castAsDate as CAST(ReportDate as date)

create index rf_so2 on REPORTS(castAsDate) include (ReportDate)


One of the query patterns I occasionally use to get into a log table with similiar indexing to yours is to limit by subquery:

DECLARE @ReportDate varchar(8)
SET @ReportDate = Convert(varchar(8), GetDate(), 112)

SELECT *
FROM
(
SELECT top 20000 *
FROM Reports
ORDER BY ID desc
) sub
WHERE sub.ReportDate = @ReportDate

20k/4M = 0.5% of the table is read.


Here's a loop solution. Note: might want to make ID primary key and Reportdate indexed in the temp table.

DECLARE @ReportDate varchar(8)
SET @ReportDate = Convert(varchar(8), GetDate(), 112)

DECLARE @CurrentDate varchar(8), MinKey bigint


SELECT top 2000 * INTO #MyTable
FROM Reports ORDER BY ID desc

SELECT @CurrentDate = MIN(ReportDate), @MinKey = MIN(ID)
FROM #MyTable

WHILE @ReportDate <= @CurrentDate
BEGIN

  SELECT top 2000 * INTO #MyTable
  FROM Reports WHERE ID < @MinKey ORDER BY ID desc

  SELECT @CurrentDate = MIN(ReportDate), @MinKey = MIN(ID)
  FROM #MyTable

END

SELECT * FROM #MyTable
WHERE ReportDate = @ReportDate


DROP TABLE #MyTable
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号