I have a couple of numerical datasets that I need to create a concept hierarchy for. For now, I have been doing this manually by observing the 开发者_开发技巧data (and a corresponding linechart). Based on my intuition, I created some acceptable hierarchies.
This seems like a task that can be automated. Does anyone know if there is an algorithm to generate a concept hierarchy for numerical data?
To give an example, I have the following dataset:
Bangladesh 521
Brazil 8295
Burma 446
China 3259
Congo 2952
Egypt 2162
Ethiopia 333
France 46037
Germany 44729
India 1017
Indonesia 2239
Iran 4600
Italy 38996
Japan 38457
Mexico 10200
Nigeria 1401
Pakistan 1022
Philippines 1845
Russia 11807
South Africa 5685
Thailand 4116
Turkey 10479
UK 43734
US 47440
Vietnam 1042
for which I created the following hierarchy:
- LOWEST ( < 1000)
- LOW (1000 - 2500)
- MEDIUM (2501 - 7500)
- HIGH (7501 - 30000)
- HIGHEST ( > 30000)
Maybe you need a clustering algorithm?
Quoting from the link:
Cluster analysis or clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields
Jenks Natural Breaks is a very efficient single dimension clustering scheme: http://www.spatialanalysisonline.com/OUTPUT/html/Univariateclassificationschemes.html#_Ref116892931
As comments have noted, this is very similar to k-means. However, I've found it even easier to implement, particularly the variation found in Borden Dent's Cartography: http://www.amazon.com/Cartography-Thematic-Borden-D-Dent/dp/0697384950
I think you're looking for something akin to data discretization that's fairly common in AI to convert continuous data (or discrete data with such a large number of classes as to be unwieldy) into discrete classes.
I know Weka uses Fayyad & Irani's MDL Method as well as Kononeko's MDL method, I'll see if I can dig up some references.
This is only a 1-dimensional problem, so there may be a dynamic programming solution. Assume that it makes sense to take the points in sorted order and then make n-1 cuts to generate n clusters. Assume that you can write down a penalty function f() for each cluster, such as the variance within the cluster, or the distance between min and max in the cluster. You can then minimise the sum of f() evaluated at each cluster. Work from one point at a time, from left to right. At each point, for 1..# clusters - 1, work out the best way to split the points so far into that many clusters, and store the cost of that answer and the location of its rightmost split. You can work this out for point P and cluster size c as follows: consider all possible cuts to the left of P. For each cut add f() evaluated on the group of points to the right of the cut to the (stored) cost of the best solution for cluster size c-1 at the point just to the left of the cut. Once you have worked your way to the far right, do the same trick once more to work out the best answer for cluster size c, and use the stored locations of rightmost splits to recover all the splits that give that best answer.
This might actually be more expensive than a k-means variant, but has the advantage of guaranting to find a global best answer (for your chosen f() under these assumptions).
Genetic hierarchical clustering algorithm
I was wondering.
Apparently what you are looking for are clean breaks. So before launching yourself into complicated algorithms, you may perhaps envision a differential approach.
[1, 1.2, 4, 5, 10]
[20%, 333%, 25%, 100%]
Now depending on the number of breaks we are looking for, it's a matter of selecting them:
2 categories: [1, 1.2] + [4, 5, 10]
3 categories: [1, 1.2] + [4, 5] + [10]
I don't know about you but it does feel natural in my opinion, and you can even use a treshold approach saying that a variation less than x%
is not worth considering a cut.
For example, here 4 categories
does not seem to make much sense.
精彩评论