开发者

How to generate records and spread them among pairs from a table?

开发者 https://www.devze.com 2023-03-10 19:33 出处:网络
I have to generate about a million random trips between about 40K destinations. Each destination has it\'s own weight (total_probability), the more it is, the more trips should start or end in this pl

I have to generate about a million random trips between about 40K destinations. Each destination has it's own weight (total_probability), the more it is, the more trips should start or end in this place.

Either the trips should be generated randomly, but destinations (start and end points) should be weighted by probability, or it's possible to just pre-calculate an e开发者_Python百科xact number of trips (divide each weight by the sum of weights, multiply by 1M and round to integers).

Problem is how to make it in PostgreSQL without generating the 40K*40K table with all destinations pairs.

          Table "public.dests"
   Column          |       Type       | Modifiers 
-------------------+------------------+-----------
 id                | integer          | 
 total_probability | double precision | 

          Table "public.trips"
   Column   |       Type       | Modifiers 
------------+------------------+-----------
 from_id    | integer          | 
 to_id      | integer          | 
 trips_num  | integer          | 
 ...
 some other metrics...

primary key for trips is (from_id, to_id) Should I generate a table with 1M records and then update it iteratively, or a for loop with 1M inserts will be fast enough? I work on a 2-core lightweight laptop.

P.S. I gave up and did this in Python. To perform a set of queries and the transformation in Python, I'll run SQL scripts from Python rather than from a shell script. Thanks for suggestions!


In 9.1, you can use TRIGGERs on VIEWs, which effectively let you create materialized views (albeit manually). I think your first run may be expensive, but using a loop is probably the way to go, but then after that, I'd use a series of TRIGGERs to maintain the data in a table.

At the end of the day you need to decide whether or not you want to calculate the results for every query, or you memoize the result via a materialized view.


I'm confused by your requirement but I guess this can get you started:

select 
    f.id as "from", t.id as to, 
    f.total_prob as from_prob, t.total_prob as to_prob
from 
    (
        select id, total_prob
        from dest
        order by random()
        limit 1010
    ) f
    inner join
    (
        select id, total_prob
        from dest
        order by random()
        limit 1010
    ) t on f.i != t.i
order by random()
limit 1000000
;

EDIT:

This took about ten minutes in my not that modern desktop:

create table trips (from_id integer, to_id integer, trip_prob double precision);

insert into trips (from_id, to_id, trip_prob)
select 
    f.id, t.id, f.total_prob * t.total_prob
from 
    (
        select id, total_prob
        from dests
    ) f
    inner join
    (
        select id, total_prob
        from dests
    ) t on f.id != t.id
where random() <= f.total_prob * t.total_prob
order by random()
limit 1000000
;

alter table trips add primary key (from_id, to_id);

select * from trips limit 5;
 from_id | to_id |     trip_prob      
---------+-------+--------------------
       1 |     6 | 0.0728749980226821
       1 |    11 |  0.239824750923743
       1 |    14 |  0.235899211677577
       1 |    15 |  0.176168172647811
       1 |    17 |   0.19708509944588
(5 rows)
0

精彩评论

暂无评论...
验证码 换一张
取 消