Algorithm to generate numerical concept hierarchy_问答_开发者

I have a couple of numerical datasets that I need to create a concept hierarchy for. For now, I have been doing this manually by observing the 开发者_开发技巧data (and a corresponding linechart). Based on my intuition, I created some acceptable hierarchies.

This seems like a task that can be automated. Does anyone know if there is an algorithm to generate a concept hierarchy for numerical data?

To give an example, I have the following dataset:

Bangladesh     521
Brazil         8295
Burma          446
China          3259
Congo          2952
Egypt          2162
Ethiopia       333
France         46037
Germany        44729
India          1017
Indonesia      2239
Iran           4600
Italy          38996
Japan          38457
Mexico         10200
Nigeria        1401
Pakistan       1022
Philippines    1845
Russia         11807
South Africa   5685
Thailand       4116
Turkey         10479
UK             43734
US             47440
Vietnam        1042

Algorithm to generate numerical concept hierarchy

for which I created the following hierarchy:

LOWEST ( < 1000)
LOW (1000 - 2500)
MEDIUM (2501 - 7500)
HIGH (7501 - 30000)
HIGHEST ( > 30000)

Maybe you need a clustering algorithm?

Quoting from the link:

Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields

Jenks Natural Breaks is a very efficient single dimension clustering scheme: http://www.spatialanalysisonline.com/OUTPUT/html/Univariateclassificationschemes.html#_Ref116892931

As comments have noted, this is very similar to k-means. However, I've found it even easier to implement, particularly the variation found in Borden Dent's Cartography: http://www.amazon.com/Cartography-Thematic-Borden-D-Dent/dp/0697384950

I think you're looking for something akin to data discretization that's fairly common in AI to convert continuous data (or discrete data with such a large number of classes as to be unwieldy) into discrete classes.

I know Weka uses Fayyad & Irani's MDL Method as well as Kononeko's MDL method, I'll see if I can dig up some references.

This is only a 1-dimensional problem, so there may be a dynamic programming solution. Assume that it makes sense to take the points in sorted order and then make n-1 cuts to generate n clusters. Assume that you can write down a penalty function f() for each cluster, such as the variance within the cluster, or the distance between min and max in the cluster. You can then minimise the sum of f() evaluated at each cluster. Work from one point at a time, from left to right. At each point, for 1..# clusters - 1, work out the best way to split the points so far into that many clusters, and store the cost of that answer and the location of its rightmost split. You can work this out for point P and cluster size c as follows: consider all possible cuts to the left of P. For each cut add f() evaluated on the group of points to the right of the cut to the (stored) cost of the best solution for cluster size c-1 at the point just to the left of the cut. Once you have worked your way to the far right, do the same trick once more to work out the best answer for cluster size c, and use the stored locations of rightmost splits to recover all the splits that give that best answer.

This might actually be more expensive than a k-means variant, but has the advantage of guaranting to find a global best answer (for your chosen f() under these assumptions).

Genetic hierarchical clustering algorithm

I was wondering.

Apparently what you are looking for are clean breaks. So before launching yourself into complicated algorithms, you may perhaps envision a differential approach.

[1, 1.2, 4, 5, 10]

[20%, 333%, 25%, 100%]

Now depending on the number of breaks we are looking for, it's a matter of selecting them:

2 categories: [1, 1.2] + [4, 5, 10]
3 categories: [1, 1.2] + [4, 5] + [10]

I don't know about you but it does feel natural in my opinion, and you can even use a treshold approach saying that a variation less than x% is not worth considering a cut.

For example, here 4 categories does not seem to make much sense.