SqlServer Random Data Generation Observation_问答_开发者

I have a question on why the output of these two queries differ. I would have expected them to work the same.

Query 1:

declare @cache table(originalValue nvarchar(255), obfuscateValue nvarchar(255));

declare @table1 table(c char(1));
declare @i1 int;
set @i1 = ASCII('0');

while @i1 <= ASCII('9')
begin
    insert into @table1 (c)
    select (CHAR(@i1))    

    set @i1 = @i1 +1;
end


insert into @cache (originalValue, obfuscateValue)
select [firstname], 
        (select top 1 c from @table1 order by NEWID()) + 
        (select top 1 c from @table1 order by NEWID()) 
from Customer
where [firstname] is not null

select * from @cache;

Query 2:

declare @cache table(originalValue nvarchar(255), obfuscateValue nvarchar(255));

declare @table1 table(c char(1));
declare @i1 int;
set @i1 = ASCII('0');

while @i1 <= ASCII('9')
begin
    insert into @table1 (c)
    select (CHAR(@i1))    

    set @i1 = @i1 +1;
end


insert into @cache (originalValue)
select [firstname]
from Customer
where [firstname] is not null

update c
set c.obfuscateValue = t.Value
from @cache c
join 
(
    select originalValue,
    (       
        (select top 1 c from @table1 order by NEWID()) + 
        (select top 1 c from @table1 order by NEWID()) 
    ) as Value
    from @cache
) t on t.originalValue = c.originalValue

select * from @cache;

They should do the same, but first query returns following results:

Jonathon    73
Everett 73
Janet   73
Andy    73
Shauna  73

And second:

Jonathon    82
Evere开发者_JS百科tt 40
Janet   68
Andy    79
Shauna  29

As you noticed, the second column in second result has different values, while first - same values.

It looks like in first query the

(select top 1 c from @table1 order by NEWID()) + 
        (select top 1 c from @table1 order by NEWID())

is called only once.

Can someone explain this mystery?

I think random values can be generated in another way.

This is how to generated [a-zA-Z]{3,6}

declare @min int, @max int;
declare @alpha varchar(max)

set @min = 3;
set @max = 6;
set @alpha = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'

declare @cache table(originalValue nvarchar(255), obfuscateValue nvarchar(255));

insert into @cache (originalValue, obfuscateValue)
select [firstname], LEFT(t.Value, case when t.maxLen < @min then @min else t.maxLen end)
from Customer 
join
(
    select ABS(CHECKSUM(NEWID()))%@max + 1 as maxLen,
            SUBSTRING(@alpha, ABS(CHECKSUM(NEWID()))%LEN(@alpha) + 1, 1) +
            SUBSTRING(@alpha, ABS(CHECKSUM(NEWID()))%LEN(@alpha) + 1, 1) +
            SUBSTRING(@alpha, ABS(CHECKSUM(NEWID()))%LEN(@alpha) + 1, 1) +
            SUBSTRING(@alpha, ABS(CHECKSUM(NEWID()))%LEN(@alpha) + 1, 1) +
            SUBSTRING(@alpha, ABS(CHECKSUM(NEWID()))%LEN(@alpha) + 1, 1) +
            SUBSTRING(@alpha, ABS(CHECKSUM(NEWID()))%LEN(@alpha) + 1, 1) as Value
)t on t.Value is not null
where [firstname] is not null

select * from @cache;

One line?

SELECT
     RIGHT( --number of zeros to match expected max length. Or use REPLICATE.
        '000000' + CAST(
          --The 2 newid() expression means we'll get a larger number
          --less chance of using leading static zeroes
          CAST(CHECKSUM(NEWD_ID()) as bigint) * CAST(CHECKSUM(NEWD_ID()) as bigint)
            as varchar(30))
        --The 3 gives us the desired mask. Currently 3 digits.
        , 3)

You are correct in your assumption that the first query is only running the ‘select top’ once. The behavior is happening because of how the optimizer chose to optimize the query. It decided because the subqueries (the select top queries) are self-contained and are not correlated with the outside select query it uses a Tablespool (Lazy Spool) operator in the execution plan. This causes the select top value to be placed in the tempdb for reuse.

Because the optimizer chooses to use a Nested Loops operator to bring all the data together no rebinding is needed, the spooled value is used instead of reapplying the queries for each input outer row.

During the second query the optimizer chose not to use a Tablespool operator (I believe do to fact the input table being from tempdb). So you have the select top subqueries being reapplied for each input row from the temporary table.

If needed, you may be able to use a table/query hints if you want to force the execution plan to perform as desired.