What data do I need to implement k nearest neighbor?_问答_开发者

I currently have a reddit-clone type website. I'm trying to recommend posts based on the posts that my users have previously liked.

It seems like K nearest neighbor or k means are the best way to do this.

I can't seem to understand how to actually implement this. I've seen some ma开发者_StackOverflowthematical formulas (such as the one on the k means wikipedia page), but they don't really make sense to me.

Could someone maybe recommend some pseudo code, or places to look so I can get a better feel on how to do this?

K-Nearest Neighbor (aka KNN) is a classification algorithm.

Basically, you take a training group of N items and classify them. How you classify them is completely dependent on your data, and what you think the important classification characteristics of that data are. In your example, this may be category of posts, who posted the item, who upvoted the item, etc.

Once this 'training' data has been classified, you can then evaluate an 'unknown' data point. You determine the 'class' of the unknown by locating the nearest neighbors to it in the classification system. If you determine the classification by the 3 nearest neighbors, it could then be called a 3-nearest neighboring algorithm.

How you determine the 'nearest neighbor' depends heavily on how you classify your data. It is very common to plot the data into N-dimensional space where N represents the number of different classification characteristics you are examining.

A trivial example:

Let's say you have the longitude/latitude coordinates of a location that can be on any landmass anywhere in the world. Let us also assume that you do not have a map, but you do have a very large data set that gives you the longitude/latitude of many different cities in the world, and you also know which country those cities are in.

If I asked you which country the a random longitude latitude point is in, would you be able to figure it out? What would you do to figure it out?

Longitude/latitude data falls naturally into an X,Y graph. So, if you plotted out all the cities onto this graph, and then the unknown point, how would you figure out the country of the unknown? You might start drawing circles around that point, growing increasingly larger until the circle encompasses the 10 nearest cities on the plot. Now, you can look at the countries of those 10 cities. If all 10 are in the USA, then you can say with a fair degree of certainty that your unknown point is also in the USA. But if only 6 cities are in the USA, and the other 4 are in Canada, can you say where your unknown point is? You may still guess USA, but with less certainty.

The toughest part of KNN is figuring out how to classify your data in a way that you can determine 'neighbors' of similar quality, and the distance to those neighbors.

What you described sounds like a recommender system engine, not a clustering algorithm like k-means which in essence is an unsupervised approach. I cannot make myself a clear idea of what reddit uses actually, but I found some interesting post by googling around "recommender + reddit", e.g. Reddit, Stumbleupon, Del.icio.us and Hacker News Algorithms Exposed! Anyway, the k-NN algorithm (described in the top ten data mining algorithm, with pseudo-code on Wikipedia) might be used, or other techniques like Collaborative filtering (used by Amazon, for example), described in this good tutorial.

k-Means clustering in its simplest form is averaging values and keep other average values around one central average value. Suppose you have the following values

1,2,3,4,6,7,8,9,10,11,12,21,22,33,40

Now if I do k-means clustering and remember that the k-means clustering will have a biasing (means/averaging) mechanism that shall either put values close to the center or far away from it. And we get the following.

cluster-1 
1,2,3,4,5,6,7,8

cluster-2
10,11,12

cluster-3
21,22

cluster-4
33

cluster-5
40

Remember I just made up these cluster centers (cluster 1-5). So the next, time you do clustering, the numbers would end up around any of these central means (also known as k-centers). The data above is single dimensional.

When you perform kmeans clustering on large data sets, with multi dimension (A multidimensional data is an array of values, you will have millions of them of the same dimension), you will need something bigger and scalable. You will first average one array, you will get a single value, like wise you will repeat the same for other arrays, and then perform the kmean clustering.

Read one of my questions Here

Hope this helps.

To do k-nearest neighbors you mostly need a notion of distance and a way of finding the k nearest neighbours to a point that you can afford (you probably don't want to search through all your data points one by one). There is a library for approximate nearest neighbour at http://www.cs.umd.edu/~mount/ANN/. It's a very simple classification algorithm - to classify a new point p, find its k nearest neighbours and classify p according to the most popular classes amongst those k neighbours.

I guess in your case you could provide somebody with a list of similar posts as soon as you decide what nearest means, and then monitor click-through from this and try to learn from that to predict which of those alternatives would be most popular.

If you are interested in finding a particularly good learning algorithm for your purposes, have a look at http://www.cs.waikato.ac.nz/ml/weka/ - it allows you to try out a large number of different algorithms, and also to write your own as plug-ins.

Here is a very simple example of KNN for the MINST dataset Once you are able to calculate distance between your documents, the same algorithm would work

http://shyamalapriya.github.io/digit-recognition-using-k-nearest-neighbors/