Sorry for the long question. I decided to explain the context of the problem first as maybe there are other solutions to my problem. If you're in a hurry, just read THE QUESTION below.
(EDITED -- In the mean time I added some attempts to solve the problem. The fourth one has my final conclusion, you can jump straight to it.)
THE CONTEXT
I have a hash table filled with roughly 20k pairs (key(i),value(i)). I want to generate random lists like this one
[(key(213),value(213));(key(127),value(127));(key(89),value(89));...]
The restriction is that once I've chosen key(213) to be the first element of the list, not all keys can follow it (I have some other function 'decide' which can decide whether some key can be the next one in the list or not). So, I'd like to choose a random next element and check if it's appropriate -- in the example above key(127) was chosen. In the case that that element is rejected by my 'decide' function I'd like to pick another one randomly. But I wouldn't want to pick the same just rejected because I know it will be rejected again and not only would this be 开发者_高级运维inefficient, I also run the risk, if only a few keys can be the next ones, of taking a long time until I find an appropriate key. Note that there can be repetition, such as
[(key(213),value(213));(key(213),value(213));(key(78),value(78));...]
This is OK, as long as the 'decide' function accepts key(213) as the next in the list. So, all I need is a way to randomly enumerate the (key,value) pairs in the hash table. Whenever I have to choose a key, I create an enumeration, which I consume by checking each new element with the 'decide' function (so, no repetitions occur) and when I find one I add it to the list and proceed incrementing the list. The thing is that I don't want this enumeration of the hash table to be the same every time. I want it to be random. (This has to do with the structure of the search space I have in my particular problem which is not relevant here.)
I can of course implement this by generating random integers and using just lists -- that's what I'm currently doing. But, as this is something I run into quite frequently, I wonder if there is some random enumeration facility for hash tables somewhere.
THE QUESTION
Is there some random enumeration function for hash tables somewhere? I am aware of the function BatHashtbl.enum (Batteries library) but I think it will always give me the same enumeration for the same hash table (is this correct?). Also, there doesn't seem to exist anything of the sort in that BatHashtbl module. I'd be interested in something like
random_enum: ('a, 'b) t -> int -> ('a * 'b) Enum.t
which, when provided with a hash table and some integer as a seed for the random generator, will give a different random enumeration of the hash table. Any ideas?
Thanks for any help!
Best, Surikator.
FIRST ATTEMPT
After Niki's suggestion in the comments, and looking in more detail through the Batteries Library, I've come up with this
let rand_enum ht n =
BatRandom.init n;
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte (* This returns*)
in Array.to_list s
which has the type
val rand_enum : ('a,'b) BatHashtbl.t -> int -> ('a*'b) list
It uses the Fisher-Yates algorithm for the shuffling which runs in O(n). It returns a list instead of an enumeration and that is quite annoying, because it means that even if I'm happy with the third element of the list obtained with rand_enum, the function will still compute a random enumeration for the whole 20k elements in the hash table.
Best, Surikator
SECOND ATTEMPT
I defined the module RndHashtblEnum as
(* Random Hashtable Enumeration Module *)
type ('a,'b) t = {
ht:('a,'b) BatHashtbl.t;
mutable ls:('a*'b) list;
f: (('a,'b) BatHashtbl.t -> ('a*'b) list)}
let shuffle ht =
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte
in Array.to_list s
let create ht n = (BatRandom.init n; {ht=ht;ls=shuffle ht;f=shuffle})
let rec next re =
match re.ls with
| [] -> re.ls<-(re.f re.ht);next re
| h::t -> re.ls<-t; h
It has the new type t for random enumerations of hash tables. This type stores the hash table we wish to enumerate, the list we will be enumerating from and a function to compute a new enumerated list (from the hash table) once the list we have runs out. Once the list has run out, when we ask for a new random element of the hash table, the type t automatically puts a new random list created from the hash table.
So, using the above module, if we want to enumerate a hash table randomly, we simply do:
let re = RndHashtblEnum.create ht 1236
to create a random enumeration of hash table ht with random seed 1236 (In this code I assume the hash table was defined before) and we can then write
let (k,v) = RndHashtblEnum.next re
to get the next (k,v) pair from the random enumeration.
One question we may ask is whether this is actually fair randomness because I use the remaining of the list to randomly enumerate the hash table the next time I need a random enumeration. Well, it isn't. If my hash table has say 1000 elements and after extracting 5 random ones I am satisfied with the result, I know that in the next 995 (of the second set of extractions) none of those 5 elements will be extracted. So, that's not fair randomness. It's even worse. It may well be the case that in the next 1000 extractions (995 from this list, 5 from the next enumerating list) some elements will not be covered. On average, the algorithm is fair, but it isn't fair all the time.
Best, Surikator.
THIRD ATTEMPT
Hi again,
Including Niki's suggestion of using BatArray.enum and a fundamental change in the stochastic part of the algorithm, I have come up with a new improved version of the RndHashtblEnum module. The suggestion is:
(* Improved Random Hashtable Enumeration Module *)
type ('a,'b) t = {ht:('a,'b) BatHashtbl.t; mutable enum:('a*'b) BatEnum.t; enum0: ('a*'b) BatEnum.t}
let shuffle ht =
let hte = BatHashtbl.enum ht
in let s = BatRandom.shuffle hte
in BatArray.enum s
let create ht n =
let e = shuffle ht
in (BatRandom.init n; {ht=ht;enum=BatEnum.clone e;enum0=e})
let rec next re =
match BatEnum.get re.enum with
| None -> re.enum<-re.enum0; next re
| Some e -> e
This new module gets rid of the (silly) costs of passing an array to a list and only uses the Fisher-Yates algorithm once at the start -- so, in the long run, we can consider the contribution of the Fisher-Yates bit to be O(1).
The new version is now fair in terms of randomness. This is not so easy to see and it took me a bit to realize that. Suppose the hash table has 1000 entries. In the new version, we always use the same enumeration (enum0 -- fixed when we create the random enumeration with the "create" function). This means that, when trying to find the next element in our final list, since some key in the hash table will have to satisfy the "decide" function (otherwise we would couldn't continue with the algorithm and we would just stop), it will do so somewhere between the 0th and the 999th entry. Suppose it's on entry 300. Now, given we've chosen this key, for deciding the next key in the final list, our enumeration will continue with the remaining 700 elements and then will go on to the next 300 in the copy of the same enumeration. So, the 700+300 make exactly the 1000 in the hash table. This means that we will always consider each entry in the hash table once and only once. The other thing is that each time we attempt to find a key to go in the list that label could be found on entry 300 but also on entry 734 or something else, because the decide function actually depends on which previous keys have been chosen up to that point in the final list. So, each time we start fresh looking for an element for the final list in the hash table we start with a random element of the hash table.
Sorry if this is not very clear. It's hard to explain. =)
Thanks for all the comments.
Best, Surikator.
FOURTH AND FINAL ATTEMPT -- THIS IS MY PROPOSED SOLUTION
Hi yet again,
Sharing gasche's worries about mutable fields and enumerations in general and all the weird side-effects that can come from there, I decided to forget about off-the-shelf solutions using available hash tables libraries and wrote my stuff using plain lists. I also brought laziness to deal with avoiding generating random lists of which only part would be used (so there was useful lazy stuff to be used as you suggested, Niki).
I created the type
type 'a node_t =
| ENil
| ECons of 'a * 'a list * 'a t
and 'a t = ('a node_t) Lazy.t
for lazy random enumerations of lists. Each enumeration is either empty (ENil) or not (ECons) in which case it has three parts to it: (1) the element currently in focus, (2) the rest of available elements to enumerate, (3) another enumeration to continue this enumeration.
Then, a random enumeration of a list can be obtained using the create
function
let rec create ls =
lazy( match ls with
| [] -> ENil
| h::t -> let n = Random.int (List.length ls)
in let newx,rest=remove ls n
in ECons(newx,rest,create t))
where the auxiliary remove
function has been defined to extract the nth element of a list and return a pair (x,ls)
where x
is the extracted element and ls
is the new list without the extracted element. Just for completeness I add the code of the remove
function here too.
let rec remove ls n =
let rec remove_ ls acc k n =
match ls with
| [] -> raise (Failure "remove")
| h::t -> if k=n
then h, List.rev_append acc t
else remove_ t (h::acc) (k+1) n
in remove_ ls [] 0 n
We can now define very simple functions for generating the next state of the random enumeration and for getting the actual element in each state of the enumeration. Those are
exception End_of_enum
let next e =
match Lazy.force e with
| ENil -> raise End_of_enum
| ECons(x,ls,t) -> t
let rec get e =
match Lazy.force e with
| ENil -> raise End_of_enum
| ECons(x,ls,t) -> x
OK, up until now, I've simply enumerated lists randomly. If we want to enumerate a hash table instead, we can use
let rand_enum ht =
let ls = Hashtbl.fold (fun k v acc -> (k, v) :: acc) ht []
in create ls
to obtain a random enumeration of the pairs in the hash table and we can use next and get to obtain the (key,value) pairs. The fold
but is just a way of getting all the (key,value) pairs of the hash table in a list (thanks Pascal for your answer in this question).
This ends the whole hash table enumeration thing. For the sake of completeness, I'm adding also the solution to the overall problem I was trying to solve, explained in "The Context" above. The problem, if you recall, is to randomly generate a list of (key,value) pairs from (1) a hash table, and (2) a decide
function which can tell whether a (key,value) can be appended to some particular list of pairs. Since the whole generation process may never terminate, to ensure termination I thought it would make sense to have a third argument which is a function which tells whether we should stop the process or not (and which we should make sure will return true at some point for the whole process to terminate).
The function generate
could be something like
let generate ht d stop =
let rec gen1 d fst e =
if d (List.rev fst) (get e)
then (get e)::fst
else gen1 d fst (next e)
in let rec generate_ ht d stop acc =
let e = rand_enum ht
in if stop acc
then acc
else try generate_ ht d stop (gen1 d acc e)
with End_of_enum -> generate_ ht d stop (List.tl acc)
in generate_ ht d stop []
Thanks a lot to everyone who contributed with useful comments. This was really helpful.
All the best, Surikator.
I have two suggestions. The first is to change your rand_enum
function so it returns an Enum.t:
let rand_enum ht n =
BatRandom.init n;
let hte = BatHashtbl.enum ht
in Array.enum (BatRandom.shuffle hte)
which isn't terribly different (it's still computing a random enum for all 20k) but is closer to what you originally wanted.
Alternatively, you could always take the source code of HashTbl and recompile it with a rand_enum function. However this also probably won't be that different, as a HashTbl is implemented as an array and if you want to avoid bad duplicates you're probably going to end up using a shuffle.
What is the density of potential next element ? What is the cost of your decide
function ?
All your current solution have an O(n) cost. Fisher-Yates is O(n) (and it does not make much sense to try to adapt it for Enums, as it would require forcing the enumeration anyway), and Array.to_list alos is O(n).
If your decide
function is fast enough and your density low enough, I think it may be simpler to just build a list/array of all eligible elements (calling decide
on each element of the table), then randomly pick one of them.
If the density is high enough and decide
costly, I think your first idea, picking keys at random and keeping a list of already-encountered keys. You will be able to pick the first eligible element encountered (optimal number of decide
calls). This way to enumerate a sequence gets costly "in the end", when all elements have already been picked, but if your density is high you won't run into that case.
If you don't know, it may be interesting to start with the "high density" hypothesis, and change your mind once you've seen a given portion of the table, and still found nothing.
Finally: if you don't need to add/remove elements during the generation of your sequence, it would be interesting to convert your hashtable into an array once and forall (keeping another key -> array index table somewhere), as all such problems are simpler when the indexing is contiguous.
Your implementations )(second and third) are too complicated. I don't like mutable
and I don't like Enum
. Combining them both is the best way to shoot yourself in the foot with uncontrolled side-effects.
I also think your particular problem is too specific to be solved by a generic-looking "shuffle something and that's it" function. Trying to find such a domain-independent function which also solves your domain-specific problem is maybe why your successive implementation get uglier and more complex at each attempt.
Producing a random stream from a Hashtable is simple : BatHashtbl.enum |- BatRandom.shuffle |- BatArray.enum
. The rest of your code should concern the use of the decide
function.
I doubt that there is such function given the interface exposed by Hashtbl. Obvious approach like getting all the values into an array and doing lookups by Array.get a (Random.int (Array.length a))
looks fine to me.
精彩评论