Is there a good way to do this type of mining?_问答_开发者

I am trying to find points that are closest in space in X and Y directions (sample dataset given at the end) and am looking to see if there are smarter approaches to do this than my trivial (and untested) approach. The开发者_运维百科 plot of these points in space looks something like the following and am trying to find sets of points marked inside the boxes i.e. the output I am looking for is a set of groups:

Group 1: (1,23), (2,23), (3,23)...
Group 2: (68,200), (68,201), (68,203), (68,204), (68,100), (68,101), (68,101)...

Is there a good way to do this type of mining?

For the horizontal bands, I am thinking I could just go ahead and use small sliding windows of size say, 5 or 10 (which should really be determined from the global information of which size will give the maximum grouped points but I am still exploring a good approach) and search for contiguous points because a break would not be considered a horizontal band anymore.

I am guessing the same approach works for the vertical bands as well but not in all cases because there is a subtle difference in horizontal and vertical bands: points should appear close to be considered a group horizontally but they can appear anywhere to be considered part of a vertical band. Observe the large vertical band in the figure. So I am guessing I could just look for points that have the same x-coordinate (in this case, x=68) should give me a lot of points.

Other than this trivial solution, I can't think of anything smart that can be done here as this problem appears deceptively simple to me. Am I missing something here? Does this fall into some known class of problems and if so, is there a good and scalable approach to achieve this?

Sample Dataset:

This is a little late, but this problem has been worrying me for some time. I was sure it could be solved with mixed integer / linear programming techniques and asked for help in this question: Identifying column and row clusters with linear programming

However, after getting a reply there, I had the insight that your problem, at least as I understand it, is so simple (when framed as a constraint program) that you can solve it trivially with a simple program (which you already knew). In other words, constraint programming would be a cool way to solve this, but, at least with the approach I found, would give you the same answer as something much simpler.

I'll explain below my reasoning, how I would implement it with a constraint solving package, and then give the final, trivial, algorithm.

Mixed integer programming solution

The most important detail is the difference between horizontal and vertical groups. As far as i can see, anything that aligns vertically can be in the same group. But horizontal groups are different - components have to be close together.

The hardest part of solving a problem with constraints seems to be finding a way to describe the limits in a way that the solver can understand. I won't go into the details here, but solvers are frustratingly limited. Luckily I think there is a way to do this here, and it is to consider horizontal neighbours: if there are N points in a row then we have N-1 sets of neighbours (for example, with 4 points A B C and D there are the three pairs AB, BC, and CD).

For each pair, we can give a score, which is the number of spaces between them (S_i) scaled by some factor K, and a flag (F_i) which is 0 or 1. If the pair are in the same horizontal group then we set the flag to 1, otherwise it is zero.

It is critical to see that the set of flags for all the pairs completely defines a solution. We can run across any row, placing pairs with a flag of 1 in the same horizontal group, and starting a new horizontal group each time the flag is 0. Then, we can take all horizontal groups of size 1 and convert them into vertical groups: any point that is not in a horizontal group must be in a vertical group (even if it is a vertical group of just one).

So all we need now is a way to express an optimal solution in terms of the flags. I suggest that we want to minimise:

sum(1 - F_i) + sum(K * S_i * F_i)

This has two terms. The first is the sum of "one minus the flag" for each pair. The flag is 1 when the points are in the same horizontal group and 0 otherwise. So minimising this value is the same as saying that we want as few horizontal groups as possible. If this was the only constraint then we could set it to zero by making all the F_i 1 - by making all pairs on a row members of the same group.

But the second term stops us from choosing such an extreme solution. It penalises groups with gaps. If a pair are in the same group, but are separated by S_i spaces, then we have a "penalty" of K * S_i.

So we have a trade-off. We want horizontal groups, but we don't want gaps. The final solution will depend on K - if it is large then we won't include any spaces in horizontal groups. But as it is decreased we will start to do so, until when it is very small (tends to zero) we place everything in a row in a single group.

To use this you would choose some K, calculate the S_i, and enter the expression above into a constraint system. The system would then choose F_i to minimise the expression. Finally you would convert the F_i into a pattern of groups by scanning each row as described above and then grouping singletons vertically.

Analytic solution

OK, cool. At this point we have a way to express the problem that we can give to a constraint engine.

But it's trivial to solve! We don't need no stinkin' constraint engine to solve this - we can just look at the expression:

sum(1 - F_i) + sum(K * S_i * F_i)

The two sums are over the same pairs, so we can move everything into the sum:

sum(1 - F_i + K * S_i * F_i)
sum(1 + F_i * (K * S_i - 1))

And then extract the constant (N here is the total number of pairs):

N + sum(F_i * (K * S_i - 1))

Now note that each term in the sum is independent (and additive). So for each term, we want the minimum value. We have two options:

if F_i is 0 then the entire term is 0.
otherwise, F_i is 1 and the term is K * S_i - 1.

So the best choice depends on whether K * S_i is greater than 1. If K * S_i is greater than 1 then the smallest value of the term is 0, and F_i should be 0. Otherwise the second choice above is negative, and F_i should be one.

Trivial algorithm

What does this mean? It means that for each pair we can simply look at the number of spaces, S_i. If that is greater than 1 / K then the two points should be in separate groups. Otherwise they should be in the same group.

So all this fancy maths and optimisation and constraints and bullshitting comes down to: how far apart are two points in neighbouring pairs? If they are closer than some cut-off, put them in the same horizontal group. Otherwise, put them in separate groups.

So here, finally, is your algorithm:

choose some cut-off value, X
place each point in its own, singleton, horizontal group
for each row with more than one point:
    for each neighbouring pair in the row:
        if the space between the pair is less than X:
            join into a single horizontal group
for each column:
    join any remaining singleton groups into a single vertical group

Conclusion

You can use constraint programming techniques to solve this problem, but such techniques are restricted to solutions that describe the system in the "correct" (typically, linear) way.
The simplest such approach I can find is equivalent to a trivial, direct algorithm that divides points in a row into horizontal groups depending on the number of spaces between them.
This all depends on a whole pile of assumptions about what you wanted which may, of course, be over-simplifications, or just plain wrong.

You can try using cluster module. It contains implementation of K-means clustering algorithm. You can adjust argument of getclusters function to change the number of clusters you want.

s = '''
1,23
1,23
2,23
...
68,101
68,103
68,104
'''

from cluster import *

ll = [tuple(map(int,each.split(','))) for each in s.split()]

#horizontal 
cl = HierarchicalClustering(ll, lambda x,y: abs(x[0]-y[0]))

for c in cl.getlevel(1):
    print c

#vertical
cl = HierarchicalClustering(ll, lambda x,y: abs(x[1]-y[1]))

for c in cl.getlevel(1):
    print c