开发者

Assign values to records randomly based on given distributions

开发者 https://www.devze.com 2023-02-17 02:18 出处:网络
I have a t-sql procedure that generates test data and as part of this, several columns need to be set randomly to a set of values following a predefined distribution. I currently have a method for doi

I have a t-sql procedure that generates test data and as part of this, several columns need to be set randomly to a set of values following a predefined distribution. I currently have a method for doing this using an identity column and the remainder operator % but I wondered if th开发者_运维技巧ere was a more elegant way of designing a solution to this problem that would allow me to parameterise the values and distribution that the columns are assigned.

For example,

I want to assign Column A one of three values {Horse, Donkey, Pony} and column B one of two values {Big, Small}. In this case I want the distributions to be equal, {Horse:1/3, Donkey 1/3, Pony, 1/3}, {Big:1/2, Small:1/2} but Column B's distribution should depend on Column A (i.e. 1/2 of the Ponies should be Big)

Record | ColumnA | ColumnB | OtherData
1 | Horse | Big |...
2 | Horse | Small |...
3 | Donkey | Big |...
4 | Donkey | Small |...
5 | Pony | Big |...
6 | Pony | Small |...

I would prefer to have a semi-random allocation of Column A and B with respect to record numbers but this is not essential.

In the case where a distribution is not exactly possible, all remaining records should be assigned one of the possible values. It is assumed that in large data sets this will be work itself out.


Not sure if this will help in your case, but this will work on SQL Server 2005+. This is a set of 300 records, yielding 100 of each animal, and 50 of each size per animal.

There may be a similar way to approach this in other SQL languages.

DECLARE @MaxCount INT
SET @MaxCount = 300

SET NOCOUNT ON;
DECLARE @Numbers TABLE (Number INT NOT NULL IDENTITY);
INSERT @Numbers DEFAULT VALUES;
WHILE SCOPE_IDENTITY() < @MaxCount INSERT @Numbers DEFAULT VALUES;

SELECT  *
FROM    (
    SELECT  RandomSize.Number
    ,   RandomSize.Animal
    ,   CASE RandomSize.RowNumber % 2
            WHEN 0 THEN 'Big'
            WHEN 1 THEN 'Small'
        END Size
    FROM    (
        SELECT  DerivedAnimal.Number
        ,   DerivedAnimal.Animal
        ,   ROW_NUMBER() OVER (PARTITION BY DerivedAnimal.Animal ORDER BY NEWID()) RowNumber
        FROM    (
            SELECT  RandomAnimals.Number
            ,   CASE RandomAnimals.RowNumber % 3
                    WHEN 0 THEN 'Horse'
                    WHEN 1 THEN 'Donkey'
                    WHEN 2 THEN 'Pony'
                END Animal
            FROM    (
                SELECT  Number
                ,   ROW_NUMBER() OVER (ORDER BY NEWID()) RowNumber
                FROM    @Numbers
                ) RandomAnimals
            ) DerivedAnimal
        ) RandomSize
    ) FinalList
ORDER BY FinalList.Number
0

精彩评论

暂无评论...
验证码 换一张
取 消