Suppose I have a 2D array like the following:
GACTG
AGATA
TCCGA
Each array element is taken from a small finite set (in my case, DNA nucleotides -- {A, C, G, T}
). I would like to randomly shuffle this array somehow while preserving both row and column nucleotide frequencies. Is this possible? Can it be done efficiently?
[EDIT]: By this I mean I want to produce a new matrix where each row has the same number 开发者_StackOverflow中文版of A
s, C
s, G
s and T
s as the corresponding row of the original matrix, and where each column has the same number of A
s, C
s, G
s and T
s as the corresponding column of the original matrix. Permuting the rows or columns of the original matrix will not achieve this in general. (E.g. for the example above, the top row has 2 G
s, and 1 each of A
, C
and T
; if this row was swapped with row 2, the top row in the resulting matrix would have 3 A
s, 1 G
and 1 T
.)
It's simple enough to preserve just column frequencies by shuffling a column at a time, and likewise for rows. But doing this will in general alter the frequencies of the other kind.
My thoughts so far: If it's possible to pick 2 rows and 2 columns so that the 4 elements at the corners of this rectangle have the pattern
XY
YX
for some pair of distinct elements X
and Y
, then replacing these 4 elements with
YX
XY
will maintain both row and column frequencies. In the example at the top, this can be done for (at least) rows 1 and 2 and columns 2 and 5 (whose corners give the 2x2 matrix AG;GA
), and for rows 1 and 3 and columns 1 and 4 (whose corners give GT;TG
). Clearly this could be repeated a number of times to produce some level of randomisation.
Generalising a bit, any "subrectangle" induced by a subset of rows and a subset of columns, in which the frequencies of all rows are the same and the frequencies of all columns are the same, can have both its rows and columns permuted to produce a valid complete rectangle. (Of these, only those subrectangles in which at least 1 element is changed are actually interesting.) Big questions:
- Are all valid complete matrices reachable by a series of such "subrectangle rearrangements"? I suspect the answer is yes.
- Are all valid subrectangle rearrangements decomposable into a series of 2x2 swaps? [EDIT]: mhum's counterexample shows that the answer is no. Unfortunate, because this would seem to make it harder to come up with an efficient algorithm, but important to know.
- Can some or all of the valid rearrangements be computed efficiently?
This question addresses a special case in which the set of possible elements is {0, 1}
. The solutions people have come up with there are similar to what I have come up with myself, and are probably usable, but not ideal as they require an arbitrary amount of backtracking to work correctly. Also I'm concerned that only 2x2 swaps are considered.
Finally, I would ideally like a solution that can be proven to select a matrix uniformly at random from the set of all matrices having identical row frequencies and column frequencies to the original. I know, I'm asking for a lot :)
Edit: oops missed the last paragraph of OP's question, let me rephrase.
To digress briefly, the question you linked to had quite a hilarious discussion about the "level" of randomness for the selected solution, allow me to paraphrase:
"...I really require matrices that are as random as possible..."
"...The algorithm, as implemented in the code, is quite random..."
"...if you choose this method, a different way to improve the randomness is to repeat the randomization process several times (a random number of times)..."
None of these comments make any sort of sense, there is no such thing as "more" random, this is all exactly like this lovely Daily WTF entry. That said, the last quote is almost onto something. It's well known that if you simulate a Markov chain, like that random swapping algorithm, for long enough you will eventually start generating samples from the steady state distribution. Just exactly what that distribution looks like, who knows...
Anyway, depending on your objectives you may not really care what this distribution looks like as long as it contains enough elements. So some sort of swapping algorithm might be useful, but I really would not expect this to be easy since the problem is NP-Complete (more general than Sudoku).
With that in mind, you could consider solving your problem any approach that works for solving Sudoku, if you are in Acadamia I would suggest getting a copy of IBM CPLEX 12 which is free for academic use. You can code up a Sudoku-like solver in their CP language (OPL) and as the integer linear program solver to generate solutions for you. I think they even have example code for solving Sudoku you can borrow from.
Here's the only truly random and unbiased way I can think of to sample from such matrices: First get CPLEX to find all N solutions to the given Sudoku problem. After you have this set of N solutions, draw a random number between 1 and N and use that solution, if you want another one, draw another number. Since generating all solutions might be a bit slow, you could approximate something like this by telling the solver to stop after a certain number of solutions or time elapsed and only sample from that set.
The answer to question 2 is no. Consider the following 2 matrices:
A B C C A B
C A B B C A
B C A A B C
They clearly have the same row and column frequencies. Yet, there is no 2x2 submatrix with common corners.
No clue, but what you are talking about is basically a generalized sudoku solver. Try http://scholar.google.com/scholar?q=sudoku
It turns out that for 0-1 matrices, 2x2 swaps are sufficient to get from one matrix to any other. This was proved by H J Ryser as Theorem 3.1 in a paper called "Combinatorial Properties of Matrices of Zeros and Ones": http://cms.math.ca/cjm/v9/cjm1957v09.0371-0377.pdf . People have been trying to prove for a while that the Markov chain based on 2x2 swaps mixes rapidly; this paper http://arxiv.org/pdf/1004.2612v3 seems to come the closest.
If one could prove the generalization of Ryser's theorem to your case (maybe with up to 4x4 "swaps"), then on account of the symmetry of the swaps, it wouldn't be too hard to get a chain whose steady state distribution is uniform on the matrices of interest. I don't think there's any hope at the moment of proving that it mixes rapidly for all possible row/column distributions, but perhaps you know something about the distributions that we don't...
精彩评论