开发者

Select N random records per group

开发者 https://www.devze.com 2023-02-09 09:05 出处:网络
Hallo and good sunday to everybody. I need to select N random records from each group. Starting from the query of Quassnoi

Hallo and good sunday to everybody. I need to select N random records from each group.

Starting from the query of Quassnoi

http://explainextended.com/2009/03/01/selecting-random-rows/

to select X random record I wrote this store procedure

delimiter //
drop procedure if exists casualiPerGruppo //
create procedure casualiPerGruppo(in tabella varchar(50),in campo varchar(50),in numPerGruppo int)
comment 'Selezione di N record casuali per gruppo'
begin
declare elenco_campi varchar(255);
declare valore int;
declare finite int default 0;
declare query1 varchar(250);
declare query2 varchar(250);
declare query3 varchar(250);
declare query4 varchar(250);
declare cur_gruppi cursor for select gruppo from tmp_view;
declare continue handler for not found set finite = 1;

drop table if exists tmp_casuali;
set @query1 = concat('create temporary table tmp_casuali like ', tabella);
prepare stmt from @query1;
execute stmt;
deallocate prepare stmt;

set @query2 = concat('create or replace view tmp_view as select ',campo,' as gruppo from ',tabella,' group by ',campo);
prepare stmt from @query2;
execute stmt;
deallocate prepare stmt;

open cur_gruppi;
mio_loop:loop
fetch cur_gruppi into valore;
    if finite = 1 then
        leave mio_loop;
    end if;

set @query3 = concat("select group_concat(column_name) into @elenco_campi
              from information_schema.columns
                      where table_name = '",tabella,"'开发者_Go百科 and table_schema = database()");
prepare stmt from @query3;
execute stmt;
deallocate prepare stmt;

set @query4 = concat('insert into tmp_casuali select ',
             @elenco_campi,' from (
                     select  @cnt := count(*) + 1,
                     @lim :=', numPerGruppo,
                         ' from ',tabella,
                     ' where ',campo,' = ', valore,
                     ' ) vars
                     straight_join
                    (
                    select  r.*,
                    @lim := @lim - 1
                    from ', tabella, ' r
                    where   (@cnt := @cnt - 1)
                    and rand() < @lim / @cnt and ', campo, ' = ', valore ,
                    ') i');

prepare stmt from @query4;
execute stmt;
deallocate prepare stmt;

end loop;
close cur_gruppi;
select * from tmp_casuali;
end //
delimiter ;

that I recall in this way to give you an idea:

create table prova (
id int not null auto_increment primary key,
id_gruppo int,
altro varchar(10)
) engine = myisam;


insert into prova (id_gruppo,altro) values 
(1,'aaa'),(2,'bbb'),(3,'ccc'),(1,'ddd'),(1,'eee'),(2,'fff'),
(2,'ggg'),(2,'hhh'),(3,'iii'),(3,'jjj'),(3,'kkk'),(1,'lll'),(4,'mmm');

call casualiPerGruppo('prova','id_gruppo',2);

My problem is that Quassnoi query, even though is very performant, it takes even 1 second on a large recorset. So if I apply it within my sp several times, the total time increases a lot.

Can you suggest me a better way to solve my problem? Thanks in advance

EDIT.

create table `prova` (
  `id` int(11) not null auto_increment,
  `id_gruppo` int(11) default null,
  `prog` int(11) default null,
  primary key (`id`)
) engine=myisam charset=latin1;

delimiter //
drop procedure if exists inserisci //
create procedure inserisci(in quanti int)
begin
declare i int default 0;
while i < quanti do
insert into prova (id_gruppo,prog) values (
                        (floor(1 + (rand() * 100))),
                        (floor(1 + (rand() * 30)))
                       );
set i = i + 1;
end while;
end //

delimiter ;

call inserisci(1000000);

@Clodoaldo: My stored procedure

call casualipergruppo('prova','id_gruppo',2);

gives me 200 records and takes about 23 seconds. Your stored procedure keeps on giving me Error Code : 1473 Too high level of nesting for select even though I increase varchar value to 20000. I don't know if there is any limit on unions involved in a query.


I removed the tabella and campo parameters from the procedure just to make it easier to understand. I'm sure you can bring them back.

delimiter //
drop procedure if exists casualiPerGruppo //
create procedure casualiPerGruppo(in numPerGruppo int)
begin
declare valore int;
declare finite int default 0;
declare query_part varchar(200);
declare query_union varchar(2000);
declare cur_gruppi cursor for select distinct id_gruppo from prova;
declare continue handler for not found set finite = 1;

create temporary table resultset (id int, id_gruppo int, altro varchar(10));

set @query_part = 'select id, id_gruppo, altro from (select id, id_gruppo, altro from prova where id_gruppo = @id_gruppo order by rand() limit @numPerGruppo) ss@id_gruppo';
set @query_part = replace(@query_part, '@numPerGruppo', numPerGruppo);
set @query_union = '';

open cur_gruppi;
mio_loop:loop
fetch cur_gruppi into valore;
    if finite = 1 then
        leave mio_loop;
    end if;

set @query_union = concat(@query_union, concat(' union ', @query_part));
set @query_union = replace(@query_union, '@id_gruppo', valore);

end loop;
close cur_gruppi;

set @query_union = substr(@query_union, 8);
set @query_union = concat('insert into resultset ', @query_union);

prepare stmt from @query_union;
execute stmt;
deallocate prepare stmt;
select * from resultset order by id_gruppo, altro;
drop table resultset;

end //
delimiter ;


Wow. That's a complicated way to do something very simple. Try this:

Assuming you have sequential ids (otherwise you could get no rows).

create view random_prova as
select * from prova
where id = (select min(id) from prova) + 
    floor(RAND(0) * (select max(id) - min(id) from prova));

This will give you 1 random row.

To get multiple rows, either loop in a stored procedure or server program until you get enough rows, or programatically create a query that employs union. eg, this will give you 3 random rows:

select * from random_prova
union
select * from random_prova
union
select * from random_prova;

Note that using RAND(0) instead of RAND() means getting a different random number for each invocation. RAND() will give the same value for each invocation in the one statement (so using RAND() with union won't give you multiple rows).

There are some shortcomings with using union - it is possible to get the same row twice by chance. Programatically calling this until you get enough rows is safer.

To give better performance, use something like java to randomly select the ids for a simple query like

select * from prova where id in (...)

and have java (or perl or whatever) fill in the list with random ids - you would avoid the inefficiency of having to get the id range every time.

Post if your ids are not sequential - there is an efficient way, but I its explanation is long.

0

精彩评论

暂无评论...
验证码 换一张
取 消