开发者

Optimizing Numbers Table Creation on SQL Server?

开发者 https://www.devze.com 2023-02-08 02:45 出处:网络
Running SQL Server Express 2008. I created a \"numbers table\" for some utility functions. Since the table population is part of an automated build, this is taking an inordinate amount of time each ti

Running SQL Server Express 2008. I created a "numbers table" for some utility functions. Since the table population is part of an automated build, this is taking an inordinate amount of time each time the thing is deployed.

At the risk of "over-optimizing", can anyone comment on how I can make this go as fast as possible? Maybe playing with the index fill f开发者_如何转开发actor or when the PK is created?

IF EXISTS (SELECT *  FROM dbo.sysobjects 
WHERE id = OBJECT_ID(N'Numbers') AND OBJECTPROPERTY(id, N'IsUserTable') = 1)
BEGIN
drop TABLE [Numbers]
end

CREATE TABLE [Numbers]
(
      [Number] [int]
    , CONSTRAINT [Index_Numbers] PRIMARY KEY CLUSTERED
        (
            [number] ASC
        ) ON [PRIMARY]
) 
ON [PRIMARY]
Declare @cnt int
Select @cnt=0
SET NOCOUNT ON
while (@cnt<10000)
BEGIN
INSERT INTO NUMBERS(NUMBER) SELECT @cnt
SELECT @cnt=@cnt+1

end


SQL, Auxiliary table of numbers by Jeff Moden. This one was the fastest

--===== Itzik's CROSS JOINED CTE method
   WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),
        E02(N) AS (SELECT 1 FROM E00 a, E00 b),
        E04(N) AS (SELECT 1 FROM E02 a, E02 b),
        E08(N) AS (SELECT 1 FROM E04 a, E04 b),
        E16(N) AS (SELECT 1 FROM E08 a, E08 b),
        E32(N) AS (SELECT 1 FROM E16 a, E16 b),
   cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
 SELECT N
   INTO #Tally4
   FROM cteTally
  WHERE N <= 1000000;
GO


Got to do some tests wrapping things in functions and using different scenarios to fill a table.

Options

Mikael's proposed solution can be improved by removing the WHERE clause WHERE N <= @count and adding a TOP clause TOP (@count). That reduces it's execution time almost six-fold for me in the following:

This is the fastest algorithm! Now you know it :o)

create function dbo.Numbers2(@count bigint)
RETURNS TABLE RETURN
--===== Itzik's CROSS JOINED CTE method
   WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),
        E02(N) AS (SELECT 1 FROM E00 a, E00 b),
        E04(N) AS (SELECT 1 FROM E02 a, E02 b),
        E08(N) AS (SELECT 1 FROM E04 a, E04 b),
        E16(N) AS (SELECT 1 FROM E08 a, E08 b),
        E32(N) AS (SELECT 1 FROM E16 a, E16 b),
   cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
 SELECT TOP (@count) N
   FROM cteTally

Compared to the slower adaptation:

create function dbo.Numbers3(@count bigint)
RETURNS TABLE RETURN
--===== Itzik's CROSS JOINED CTE method
   WITH E00(N) AS (SELECT 1 UNION ALL SELECT 1),
        E02(N) AS (SELECT 1 FROM E00 a, E00 b),
        E04(N) AS (SELECT 1 FROM E02 a, E02 b),
        E08(N) AS (SELECT 1 FROM E04 a, E04 b),
        E16(N) AS (SELECT 1 FROM E08 a, E08 b),
        E32(N) AS (SELECT 1 FROM E16 a, E16 b),
   cteTally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N) FROM E32)
 SELECT N
   FROM cteTally
   WHERE N <= @count

The solution proposed in https://stackoverflow.com/a/14386663/481812 has similar performance to Numbers2:

create function dbo.Numbers(@count bigint)
RETURNS TABLE RETURN
with byte (n) as (
  -- shortened, see the other answer for full code

Tests

  1. Inserting into a temp heap is fastest (600),
  2. inserting into temp table with PK clustered index slower (3000) and finally
  3. filling a physical table with index the slowest (10000). Additional comparison is made with
  4. filling a temp table with identity PK, as Gregory suggested, from the same CTE style rows generation is pretty much the same (3000) as clustered temp #2 in performance.
  5. Gregory's proposal was too slow to consider (21000), though. To amaze you,
  6. the original example would not be realistic to measure with a million rows; 10000 was used instead, which is 100 times less than in the other tests, to bring it down to reasonable time (15000). Assuming it would scale proportionally - best case scenario - it should theoretically complete 1m rows in 25 minutes (1500000).

Ben's suggestion (tried) would probably make a difference if data were not sorted - here we're inserting sorted data, hence always appending, no performance gain building the index after.

Test group 1 - temp heap inserts

CREATE TABLE #Numbers   ([Number] [int]) 
insert #Numbers select n from dbo.Numbers(1000000)
drop table #Numbers
-- 480 - 900 ms, avg 600 ms

CREATE TABLE #Numbers   ([Number] [int]) 
insert #Numbers select n from dbo.Numbers2(1000000)
drop table #Numbers
-- 440 - 800 ms, avg 550 ms

    -- just in case you wondered how it scales, 30 times more went in in 14000, 
    -- Numbers and Numbers2 scale linearly

CREATE TABLE #Numbers   ([Number] [int]) 
insert #Numbers select n from dbo.Numbers3(1000000)
drop table #Numbers
-- 2700 - 4000 ms, avg 3200 ms

Test group 2 - inserting in temp cluster

CREATE TABLE #Numbers   ([Number] [int]  primary key) 
insert #Numbers select n from dbo.Numbers(1000000)
drop table #Numbers
-- 1900 - 4000 ms, avg 3000 ms

CREATE TABLE #Numbers   ([Number] [int]  primary key) 
insert #Numbers select n from dbo.Numbers2(1000000)
drop table #Numbers
-- 1900 - 4000 ms, avg 3000 ms

CREATE TABLE #Numbers   ([Number] [int]  primary key) 
insert #Numbers select n from dbo.Numbers3(1000000)
drop table #Numbers
-- 4000 - 7000 ms, avg 5000 ms

Test group 3 - inserting in physical table cluster

CREATE TABLE PNumbers   ([Number] [int]  primary key) 
insert PNumbers select n from dbo.Numbers(1000000)
drop table PNumbers
-- 7000 - 12000 ms, avg 10000 ms

CREATE TABLE PNumbers   ([Number] [int]  primary key) 
insert PNumbers select n from dbo.Numbers2(1000000)
drop table PNumbers
-- 7000 - 12000 ms, avg 10000 ms

CREATE TABLE PNumbers   ([Number] [int]  primary key) 
insert PNumbers select n from dbo.Numbers3(1000000)
drop table PNumbers
-- 7000 - 12000 ms, avg 10000 ms

Test 4 - filling temp clustered identity column from rows generated

CREATE TABLE #Numbers ([Number] [int] identity primary key, x bit) 
INSERT INTO #Numbers(x) select 1 from dbo.Numbers2(1000000)
drop table #Numbers
-- 2000 - 4000 ms, avg 3000 ms

Test 5 - filling temp clustered identity column in a loop

CREATE TABLE #Numbers ([Number] [int] identity primary key) 
declare @cnt int = 0
while (@cnt<1000000) 
BEGIN
    INSERT INTO #Numbers DEFAULT values
    set @cnt = @cnt+1
END
drop table #Numbers
-- 20000 - 22000 ms, avg 21000 ms

Conclusions

The Numbers(n) and Numbers2(n) options perform consistently the same (best). Numbers3(n) is a way slower due to the WHERE clause. Other tested scenarios are too slow to consider.

Inserting into an unindexed temp heap is fastest. As soon as you add any index (tried) to the column (either PK clustered or secondary non unique), it's not about the number generation anymore but rather about index inserts (and possibly IO). The worst is inserting into a physical table. This must be because temp tables are not necessarily written to disk (tempdb).

Proposal

According to measurements, you'll gain by

  1. using CTE + JOIN + ROW_NUMBER + TOP based solution as opposed to loops or recursive CTE,
  2. not using an index - using a heap, and
  3. by using a temporary table rather than physical table.

You may avoid the physical or temp table altogether and just use the function, but then in your subsequent queries you'll be chancing the optimizer as you'll be flying without statistics on the numbers. You may then solve this with query hints which is not nice, but works... Though, looking at the execution plan, row numbers at least for dbo.Numbers are estimated correctly.


One simple thing you can do is delay the creation of the index until after all the numbers are populated. In other words, remove it from CREATE TABLE and add a CREATE INDEX statement after your loop. This will prevent the index from being updated on each insert.


I am not sure I agree with the architecture, from a high level, but since you are merely sticking a single field, with numbers from 0 to 9999, you can alter the table definition so it is an auto-increment seeded to start at 0. It is then a matter of inserting the default value into the table. That should be extremely fast.

You can also prototype the table with the proper rows and then insert the 10,000 by "copying over" the rows. This means another table that does "nothing", which may or may not solve the problem.

Just two thoughts.

0

精彩评论

暂无评论...
验证码 换一张
取 消